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Preface 

Genetic programming (GP) is a collection of evolutionary computation tech- 
niques that allow computers to solve problems automatically. Since its in- 
ception twenty years ago, GP has been used to solve a wide range of prac- 
tical problems, producing a number of human-competitive results and even 
patentable new inventions. Like many other areas of computer science, GP 
is evolving rapidly, with new ideas, techniques and applications being con- 
stantly proposed. While this shows how wonderfully prolific GP is, it also 
makes it difficult for newcomers to become acquainted with the main ideas 
in the field, and form a mental map of its different branches. Even for people 
who have been interested in GP for a while, it is difficult to keep up with 
the pace of new developments. 

Many books have been written which describe aspects of GP. Some 
provide general introductions to the field as a whole. However, no new 
introductory book on GP has been produced in the last decade, and anyone 
wanting to learn about GP is forced to map the terrain painfully on their 
own. This book attempts to fill that gap, by providing a modern field guide 
to GP for both newcomers and old-timers. 

It would have been straightforward to find a traditional publisher for such 
a book. However, we want our book to be as accessible as possible to every- 
one interested in learning about GP. Therefore, we have chosen to make it 
freely available on-line, while also allowing printed copies to be ordered in- 
expensively from http://lulu.coin Visit http://www.gp-field-guide. 
org.uk for the details. 

The book has undergone numerous iterations and revisions. It began as 
a book-chapter overview of GP (more on this below), which quickly grew 
to almost 100 pages. A technical report version of it was circulated on the 
GP mailing list. People responded very positively, and some encouraged us 
to continue and expand that survey into a book. We took their advice and 
this field guide is the result. 
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A Compendium and to be pub- 



What’s in this book 



The book is divided up into four parts. 

Part [I] covers the basics of genetic programming (GP). This starts with a 
gentle introduction which describes how a population of programs is stored 
in the computer so that they can evolve with time. We explain how programs 
are represented, how random programs are initially created, and how GP 
creates a new generation by mutating the better existing programs or com- 
bining pairs of good parent programs to produce offspring programs. This 
is followed by a simple explanation of how to apply GP and an illustrative 
example of using GP. 

In Part [TTJ we describe a variety of alternative representations for pro- 
grams and some advanced GP techniques. These include: the evolution of 
machine-code and parallel programs, the use of grammars and probability 
distributions for the generation of programs, variants of GP which allow the 
solution of problems with multiple objectives, many speed-up techniques 
and some useful theoretical tools. 

Part |III| provides valuable information for anyone interested in using GP 
in practical applications. To illustrate genetic programming’s scope, this 
part contains a review of many real-world applications of GP. These in- 
clude: curve fitting, data modelling, symbolic regression, image analysis, 
signal processing, financial trading, time series prediction, economic mod- 
elling, industrial process control, medicine, biology, bioinformatics, hyper- 
heuristics, artistic applications, computer games, entertainment, compres- 
sion and human-competitive results. This is followed by a series of recom- 
mendations and suggestions to obtain the most from a GP system. We then 
provide some conclusions. 

Part |IV| completes the book. In addition to a bibliography and an index, 
this part includes two appendices that provide many pointers to resources, 
further reading and a simple GP implementation in Java. 
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Chapter 1 

Introduction 



The goal of having computers automatically solve problems is central to 
artificial intelligence, machine learning, and the broad area encompassed by 




fields of machine learning and artificial intelligence is: 



“to get machines to exhibit behaviour, which if done by humans, 
would be assumed to involve the use of intelligence.” 

Genetic programming (GP) is an evolutionary computation (ECj] tech- 
nique that automatically solves problems without requiring the user to know 
or specify the form or structure of the solution in advance. At the most 
abstract level GP is a systematic, domain-independent method for getting 
computers to solve problems automatically starting from a high-level state- 
ment of what needs to be done. 

Since its inception, GP has attracted the interest of myriads of people 
around the globe. This book gives an overview of the basics of GP, sum- 
marised important work that gave direction and impetus to the field and 
discusses some interesting new directions and applications. Things continue 
to change rapidly in genetic programming as investigators and practitioners 
discover new methods and applications. This makes it impossible to cover 
all aspects of GP, and this book should be seen as a snapshot of a particular 
moment in the history of the field. 



1 These are also known as evolutionary algorithms or EAs. 
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1 Introduction 




Figure 1.1: The basic control flow for genetic programming, where survival 
of the fittest is used to find solutions. 



1.1 Genetic Programming in a Nutshell 



In genetic programming we evolve a population of computer programs. That 
is, generation by generation, GP stochastically transforms populations of 
programs into new, hopefully better, populations of programs, cf. Figure [TT| 
GP, like nature, is a random process, and it can never guarantee results. 
GP’s essential randomness, however, can lead it to escape traps which de- 
terministic methods may be captured by. Like nature, GP has been very 
successful at evolving novel and unexpected ways of solving problems. (See 
Chapter 12 for numerous examples.) 

The basic steps in a GP system are shown in Algorithm |l.l| GP finds out 
how well a program works by running it, and then comparing its behaviour 
to some ideal (line 3). We might be interested, for example, in how well a 
program predicts a time series or controls an industrial process. This com- 
parison is quantified to give a numeric value called fitness. Those programs 
that do well are chosen to breed (line 4) and produce new programs for the 
next generation (line 5). The primary genetic operations that are used to 
create new programs from existing ones are: 



• Crossover: The creation of a child program by combining randomly 
chosen parts from two selected parent programs. 

• Mutation: The creation of a new child program by randomly altering 
a randomly chosen part of a selected parent program. 



1.2 Getting Started 

Two key questions for those first exploring GP are: 

1. What should I read to get started in GP? 

2. Should I implement my own GP system or should I use an existing 
package? If so, what package should I use? 



1.3 Prerequisites 
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l: Randomly create an initial population of programs from the available 



primitives (more on this in Section 2.2). 
repeat 

Execute each program and ascertain its fitness. 

Select one or two program(s) from the population with a probability 
based on fitness to participate in genetic operations (Section [273] ) . 
Create new individual program(s) by applying genetic operations with 
specified probabilities (Section |iO| . 

until an acceptable solution is found or some other stopping condition 
is met (e.g., a maximum number of generations is reached), 
return the best-so-far individual. 



Algorithm 1.1: Genetic Programming 



The best way to begin is obviously by reading this book, so you’re off to 
a good start. We included a wide variety of references to help guide people 
through at least some of the literature. No single work, however, could claim 
to be completely comprehensive. Thus Appendix [A] reviews a whole host of 
books, videos, journals, conferences, and on-line sources (including several 
freely available GP systems) that should be of assistance. 

We strongly encourage doing GP as well as reading about it; the dy- 
namics of evolutionary algorithms are complex, and the experience of trac- 
ing through runs is invaluable. In Appendix [B] we provide the full Java 
implementation of Riccardo’s TinyGP system. 



1.3 Prerequisites 



Although this book has been written with beginners in mind, unavoidably 
we had to make some assumptions about the typical background of our 
readers. The book assumes some working knowledge of computer science 
and computer programming; this is probably an essential prerequisite to get 
the most from the book. 



We don’t expect that readers will have been exposed to other flavours of 
evolutionary algorithms before, although a little background might be useful. 
The interested novice can easily find additional information on evolutionary 
computation thanks to the plethora of tutorials available on the Internet. 
Articles from Wikipedia and the genetic algorithm tutorial produced by 



Whitley (1994) should suffice. 
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1 Introduction 



1.4 Overview of this Field Guide 



As we indicated in the section entitled “What’s in this book” (page[v]), the 
book is divided up into four parts. In this section, we will have a closer look 
at their content. 

Part [T] is mainly for the benefit of beginners, so notions are introduced 
at a relaxed pace. In the next chapter we provide a description of the key 
elements in GP. These include how programs are stored (Section |2.1|) , the 
initialisation of the population (Section |2.2[ ), the selection of individuals 
(Section |2.3[ ) and the genetic operations of crossover and mutation (Sec- 
tion |2.4 ) . A discussion of the decisions that are needed before running GP 



is given in Chapter [3] These preparatory steps include the specification of 
the set of instructions that GP can use to construct programs (Sections 3.1 
and 3.2 1, the definition of a fitness measure that can guide GP towards 



good solutions (Section |3.3|), setting GP parameters (Section 3.4 1 and, fi- 



nally, the rule used to decide when to stop a GP run (Section 3.5). To help 



the reader understand these, Chapter [4] presents a step-by-step application 
of the preparatory steps (Section |4.1[ ) and a detailed explanation of a sample 
GP run (Section 4.2 1. 



After these introductory chapters, we go up a gear in Part [H] where 
we describe a variety of more advanced GP techniques. Chapter [5] consid- 
ers additional initialisation strategies and genetic operators for the main GP 
representation — syntax trees. In Chapter[6]we look at techniques for the evo- 
lution of structured and grammatically-constrained programs. In particular, 
we consider: modular and hierarchical structures including automatically de- 



fined functions and architecture-altering operations (Section 6.1), systems 



that constrain the syntax of evolved programs using grammars or type sys- 
tems (Section |6.2| ), and developmental GP (Section 6.3). In Chapter [7| we 
discuss alternative program representations, namely linear GP (Section 7.1 1 
and graph-based GP (Section [P2| ) . 

In Chapter [8] we review systems where, instead of using mutation and 
recombination to create new programs, they are simply generated randomly 
according to a probability distribution which itself evolves. These are known 

Section 18. 31 



as estimation of distribution algorithms, cf. Sections 8.1 and |8.2| 
reviews hybrids between GP and probabilistic grammars, where probability 
distributions are associated with the elements of a grammar. 

Many, if not most, real-world problems are multi-objective, in the sense 
that their solutions are required to satisfy more than one criterion at the 
same time. In Chapter [9] we review different techniques that allow GP to 
solve multi-objective problems. These include the aggregation of multiple 
objectives into a scalar fitness measure (Section |9.1| ), the use of the notion of 
Pareto dominance (Section 9.2 1, the definition of dynamic or staged fitness 
functions (Section 9.3 1, and the reliance on special biases on the genetic 
operators to aid the optimisation of multiple objectives (Section |9.4|). 



1.4 Overview of this Field Guide 
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A variety of methods to speed up, parallelise and distribute genetic pro- 
gramming runs are described in Chapter [lOj We start by looking at ways 
to reduce the number of fitness evaluations or increase their effectiveness 
(Section 10.1 1 and ways to speed up their execution (Section 10.2). We 
then point out (Section |10.3 1 that faster evaluation is not the only reason 
for running GP in parallel, as geographic distribution has advantages in 
its own right. In Section |10.4| we consider the first approach and describe 
master-slave parallel architectures (Section 10.4.1), running GP on graphics 
hardware (Section 10.4.2 ) and FPGAs (Section 10.4.3| ), and a fast method to 
exploit the parallelism available on every computer (Section 10.4.41. Finally, 
Section [10. 5| looks at the second approach discussing the geographically dis- 
tributed evolution of programs. We then give an overview of some of the 
considerable work that has been done on GP’s theory and its practical uses 
(Chapter [TTj) . 

After this review of techniques, Part |III| provides information for peo- 
ple interested in using GP in practical applications. We survey the enor- 
mous variety of applications of GP in Chapter [l2j We start with a dis- 
cussion of the general kinds of problems where GP has proved successful 
(Section 12.1) and then describe a variety of GP applications, including: 
curve fitting, data modelling and symbolic regression (Section |12.2[ ); human 
competitive results ( Section [l 2. 3 1 ); image analysis and signal processing (Sec- 
tion [l2(4|j^financial trading, time series prediction and economic modelling 
(Section |12.5 |; industrial process control (Section 12.6); medicine, biology 
and bioinformatics (Section 12.71; the evolution of search algorithms and 
optimisers (Section 12.8); computer games and entertainment applications 
(Section 12. 9|); artistic applications (12.101; and GP-based data compression 
(Section 12.11 1. This is followed by a chapter providing a collection of trou- 
bleshooting techniques used by experienced GP practitioners (Chapter 13 1 
and by our conclusions (Chapter 14 1. 

In Part |IV[ we provide a resources appendix that reviews the many 
sources of further information on GP, on its applications, and on related 
problem solving systems (Appendix [A]) . This is followed by a description 
and the source code for a simple GP system in Java (Appendix |B). The 
results of a sample run with the system are also described in the appendix 
and further illustrated via a Flip-O-Rama animatiorj^] (see Section H3 

The book ends with a large bibliography containing around 650 refer- 
ences. Of these, around 420 contain pointers to on-line versions of the corre- 
sponding papers. While this is very useful on its own, the users of the PDF 
version of this book will be able to do more if they use a PDF viewer that 
supports hyperlinks: they will be able to click on the URLs and retrieve the 
cited articles. Around 550 of the papers in the bibliography are included in 



2 This is in the footer of the odd-numbered pages in the bibliography and in the index. 
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1 Introduction 



the GP bibliography (Langdon, Gustafson, and Koza 1995-20081 ^ We have 
linked those references to the corresponding BiBJ^Xentries in the bibliog- 
raphy. Just click on the GPbib symbols to retrieve them instantaneously. 
Entries in the bibliography typically include keywords, abstracts and often 
further URLs. 

With a slight self-referential violation of bibliographic etiquette, we have 



also included in the bibliography the excellent (Poli et al. 2008) to clar- 
ify how to cite this book. UT^X users can find the BibT^X entry for 
this book at http://www.cs.bh.am.ac.uk/~wbl/biblio/gp-html/poli08_ 
f ieldguide . html 



3 Available at http://www.cs.bham.ac.uk/~wbl/biblio/ 




Part I 



Basics 




Here Alice steps through the looking glass. . . 



and the Jabberwock is slain. 
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Chapter 2 



Representation, 
Initialisation and 
Operators in Tree-based 



GP 



This chapter introduces the basic tools and terminology used in genetic 
programming. In particular, it looks at how trial solutions are represented in 
most GP systems (Section |2.1[ |, how one might construct the initial random 
population ( Section [2T2]), an d how selection (Section [2T3] ) as well as crossover 
and mutation (Section |2.4|) are used to construct new programs. 



2.1 Representation 

In GP, programs are usually expressed as syntax trees rather than as lines of 
code. For example Figure |2d~| shows the tree representation of the program 
max(x+x,x+3*y) . The variables and constants in the program (x, y and 3) 
are leaves of the tree. In GP they are called terminals , whilst the arithmetic 
operations (+, * and max) are internal nodes called functions. The sets of 
allowed functions and terminals together form the primitive set of a GP 
system. 

In more advanced forms of GP, programs can be composed of multiple 
components (e.g., subroutines). In this case the representation used in GP 
is a set of trees (one for each component) grouped together under a special 
root node that acts as glue, as illustrated in Figure |2.2| We will call these 
(sub)trees branches. The number and type of the branches in a program, 
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together with certain other features of their structure, form the architecture 
of the program. This is discussed in more detail in Section [(TT| 

It is common in the GP literature to represent expressions in a prefix no- 
tation similar to that used in Lisp or Scheme. For example, max(x+x,x+3*y) 
becomes (max (+ x x) (+ x (* 3 y))). This notation often makes it eas- 
ier to see the relationship between (sub)expressions and their corresponding 
(sub)trees. Therefore, in the following, we will use trees and their corre- 
sponding prefix-notation expressions interchangeably. 

How one implements GP trees will obviously depend a great deal on 
the programming languages and libraries being used. Languages that pro- 
vide automatic garbage collection and dynamic lists as fundamental data 
types make it easier to implement expression trees and the necessary GP 
operations. Most traditional languages used in AI research (e.g., Lisp and 
Prolog), many recent languages (e.g., Ruby and Python), and the languages 
associated with several scientific programming tools (e.g., MATLAE0 and 
Mathematical]) have these facilities. In other languages, one may have to 
implement lists/trees or use libraries that provide such data structures. 

In high performance environments, the tree-based representation of pro- 
grams may be too inefficient since it requires the storage and management 
of numerous pointers. In some cases, it may be desirable to use GP primi- 
tives which accept a variable number of arguments (a quantity we will call 
arity). An example is the sequencing instruction progn, which accepts any 
number of arguments, executes them one at a time and then returns the 




3 y 



Figure 2.1: GP syntax tree representing max(x+x,x+3*y) . 



1 MATLAB is a registered trademark of The MathWorks, Inc 
2 Mathematica is a registered trademark of Wolfram Research, Inc. 
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Figure 2.2: Multi-component program representation. 



value returned by the last argument. However, fortunately, it is now ex- 
tremely common in GP applications for all functions to have a fixed number 
of arguments. If this is the case, then, the brackets in prefix-notation ex- 
pressions are redundant, and trees can efficiently be represented as simple 
linear sequences. In effect, the function’s name gives its arity and from the 
arities the brackets can be inferred. For example, the expression (max (+ x 
x) (+ x (* 3 y) ) ) could be written unambiguously as the sequence max 
+ xx + x*3y. 

The choice of whether to use such a linear representation or an explicit 
tree representation is typically guided by questions of convenience, efficiency, 
the genetic operations being used (some may be more easily or more effi- 
ciently implemented in one representation), and other data one may wish 
to collect during runs. (It is sometimes useful to attach additional infor- 
mation to nodes, which may be easier to implement if they are explicitly 
represented) . 

These tree representations are the most common in GP, e.g., numer- 
ous high-quality, freely available GP implementations use them (see the 
resources in Appendix [A] page |148[ for more information) and so does also 
the simple GP system described in Appendix [B] However, there are other 
important representations, some of which are discussed in Chapter [TJ 



2.2 Initialising the Population 

Like in other evolutionary algorithms, in GP the individuals in the initial 
population are typically randomly generated. There are a number of dif- 
ferent approaches to generating this random initial population. Here we 
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Figure 2.3: Creation of a full tree having maximum depth 2 using the full 
initialisation method (t = time). 



will describe two of the simplest (and earliest) methods (the full and grow 
methods) , and a widely used combination of the two known as Ramped half- 
and-half. 



In both the full and grow methods, the initial individuals are generated 
so that they do not exceed a user specified maximum depth. The depth of 
a node is the number of edges that need to be traversed to reach the node 
starting from the tree’s root node (which is assumed to be at depth 0) . The 
depth of a tree is the depth of its deepest leaf (e.g., the tree in Figure pO] has 
a depth of 3). In the full method (so named because it generates full trees, 
i.e. all leaves are at the same depth) nodes are taken at random from the 
function set until the maximum tree depth is reached. (Beyond that depth, 
only terminals can be chosen.) Figure 2.3 shows a series of snapshots of the 
construction of a full tree of depth 2. The children of the * and / nodes 
must be leaves or otherwise the tree would be too deep. Thus, at both steps 
f = 3, f = 4, f = 6 and t = 7 a terminal must be chosen (x, y, 1 and 0, 
respectively). 



Although, the full method generates trees where all the leaves are at 
the same depth, this does not necessarily mean that all initial trees will 
have an identical number of nodes (often referred to as the size of a tree) 
or the same shape. This only happens, in fact, when all the functions in 
the primitive set have an equal arity. Nonetheless, even when mixed-arity 
primitive sets are used, the range of program sizes and shapes produced by 
the full method may be rather limited. The grow method, on the contrary, 
allows for the creation of trees of more varied sizes and shapes. Nodes are 
selected from the whole primitive set (i.e., functions and terminals) until 
the depth limit is reached. Once the depth limit is reached only terminals 
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t=1 t=2 t=3 




Figure 2.4: Creation of a five node tree using the grow initialisation method 
with a maximum depth of 2 (t = time). A terminal is chosen at t = 2, 
causing the left branch of the root to be closed at that point even though 
the maximum depth had not been reached. 



may be chosen (just as in the full method). Figure 2.4 illustrates this 
process for the construction of a tree with depth limit 2. Here the first 
argument of the + root node happens to be a terminal. This closes off that 
branch preventing it from growing any more before it reached the depth 
limit. The other argument is a function (-), but its arguments are forced 
to be terminals to ensure that the resulting tree does not exceed the depth 
limit. Pseudocode for a recursive implementation of both the full and grow 
methods is given in Algorithm |2.1| 

Because neither the grow or full method provide a very wide array of 
sizes or shapes on their own, Koza (1992) proposed a combination called 
ramped half-and-half. Half the initial population is constructed using full 
and half is constructed using grow. This is done using a range of depth limits 
(hence the term “ramped”) to help ensure that we generate trees having a 
variety of sizes and shapes. 



While these methods are easy to implement and use, they often make it 
difficult to control the statistical distributions of important properties such 
as the sizes and shapes of the generated trees. For example, the sizes and 
shapes of the trees generated via the grow method are highly sensitive to the 
sizes of the function and terminal sets. If, for example, one has significantly 
more terminals than functions, the grow method will almost always generate 
very short trees regardless of the depth limit. Similarly, if the number of 
functions is considerably greater than the number of terminals, then the 
grow method will behave quite similarly to the full method. The arities 
of the functions in the primitive set also influence the size and shape of the 
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procedure: gen_rnd_expr(func_set,term_set,max_d, method) 

1: if max_d = 0 or (method = grow and rand() < iiunc.set| ) 

then 

2: expr = choose _random_element( term_set ) 

3: else 

4: func = choose_random_element( func_set ) 

5: for i = 1 to arity(func) do 

6: argJ = gen_rnd_expr( func_set, term_set, max_d - 1, method ); 

7: end for 

8: expr = (func, arg_l, arg_2, ...); 

9: end if 
10: return expr 

Notes: func.set is a function set, term_set is a terminal set, max_d is the 
maximum allowed depth for expressions, method is either full or grow, 
expr is the generated expression in prefix notation and rand ( ) is a function 
that returns random numbers uniformly distributed between 0 and 1. 

Algorithm 2.1: Pseudocode for recursive program generation with the 
full and grow methods. 



trees produced by growj^] Section [tT] (page |4p|) describes other initialisation 
mechanisms which address these issues. 

The initial population need not be entirely random. If something is 
known about likely properties of the desired solution, trees having these 
properties can be used to seed the initial population. This, too, will be 
described in Section l5Tl 



2.3 Selection 

As with most evolutionary algorithms, genetic operators in GP are applied 
to individuals that are probabilistically selected based on fitness. That is, 
better individuals are more likely to have more child programs than inferior 
individuals. The most commonly employed method for selecting individuals 
in GP is tournament selection, which is discussed below, followed by fitness- 
proportionate selection, but any standard evolutionary algorithm selection 
mechanism can be used. 

In tournament selection a number of individuals are chosen at random 



'’’While these are particular problems for the grow method, they illustrate a general 
issue where small (and often apparently inconsequential) changes such as the addition or 
removal of a few functions from the function set can in fact have significant implications 
for the GP system, and potentially introduce important but unintended biases. 



2.4 Recombination and Mutation 
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from the population. These are compared with each other and the best of 
them is chosen to be the parent. When doing crossover, two parents are 
needed and, so, two selection tournaments are made. Note that tourna- 
ment selection only looks at which program is better than another. It does 
not need to know how much better. This effectively automatically rescales 
fitness, so that the selection pressure^] on the population remains constant. 
Thus, a single extraordinarily good program cannot immediately swamp the 
next generation with its children; if it did, this would lead to a rapid loss 
of diversity with potentially disastrous consequences for a run. Conversely, 
tournament selection amplifies small differences in fitness to prefer the bet- 
ter program even if it is only marginally superior to the other individuals in 
a tournament. 

An element of noise is inherent in tournament selection due to the ran- 
dom selection of candidates for tournaments. So, while preferring the best, 
tournament selection does ensure that even average-quality programs have 
some chance of having children. Since tournament selection is easy to imple- 
ment and provides automatic fitness rescaling, it is commonly used in GP. 

Considering that selection has been described many times in the evolu- 
tionary algorithms literature, we will not provide details of the numerous 
other mechanisms that have been proposed. (Goldberg 1989), for example, 



describes fitness-proportionate selection, stochastic universal sampling and 
several others. 



2.4 Recombination and Mutation 

GP departs significantly from other evolutionary algorithms in the imple- 
mentation of the operators of crossover and mutation. The most commonly 
used form of crossover is subtree crossover. Given two parents, subtree 
crossover randomly (and independently) selects a crossover point (a node) 
in each parent tree. Then, it creates the offspring by replacing the subtree 
rooted at the crossover point in a copy of the first parent with a copy of 
the subtree rooted at the crossover point in the second parent, as illustrated 
in Figure |2.5| Copies are used to avoid disrupting the original individuals. 
This way, if selected multiple times, they can take part in the creation of 
multiple offspring programs. Note that it is also possible to define a version 
of crossover that returns two offspring, but this is not commonly used. 

Often crossover points are not selected with uniform probability. Typical 
GP primitive sets lead to trees with an average branching factor (the num- 
ber of children of each node) of at least two, so the majority of the nodes 
will be leaves. Consequently the uniform selection of crossover points leads 

4 A key property of any selection mechanism is selection pressure. A system with a 
strong selection pressure very highly favours the more fit individuals, while a system with 
a weak selection pressure isn’t so discriminating. 
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Parents ► Offspring 




Figure 2.5: Example of subtree crossover. Note that the trees on the left 
are actually copies of the parents. So, their genetic material can freely be 
used without altering the original individuals. 



to crossover operations frequently exchanging only very small amounts of 
genetic material (i.e., small subtrees); many crossovers may in fact reduce 
to simply swapping two leaves. To counter this, Koza (1992) suggested the 
widely used approach of choosing functions 90% of the time and leaves 10% 
of the time. Many other types of crossover and mutation of GP trees are 
possible. They will be described in Sections |5.2| and [573] pages |42] - |46l 

The most commonly used form of mutation in GP (which we will call 
subtree mutation ) randomly selects a mutation point in a tree and substi- 
tutes the subtree rooted there with a randomly generated subtree. This is 
illustrated in Figure 2.6 Subtree mutation is sometimes implemented as 



crossover between a program and a newly generated random program; this 
operation is also known as “headless chicken” crossover (Angeline 1997). 
Another common form of mutation is point mutation , which is GP’s 



rough equivalent of the bit-flip mutation used in genetic algorithms (Gold- 



berg 1989). In point mutation, a random node is selected and the primitive 



stored there is replaced with a different random primitive of the same arity 
taken from the primitive set. If no other primitives with that arity ex- 
ist, nothing happens to that node (but other nodes may still be mutated). 
When subtree mutation is applied, this involves the modification of exactly 
one subtree. Point mutation, on the other hand, is typically applied on a 
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Parents 



Offspring 



Mutation Mutation 




per-node basis. That is, each node is considered in turn and, with a certain 
probability, it is altered as explained above. This allows multiple nodes to 
be mutated independently in one application of point mutation. 

The choice of which of the operators described above should be used 
to create an offspring is probabilistic. Operators in GP are normally mu- 
tually exclusive (unlike other evolutionary algorithms where offspring are 
sometimes obtained via a composition of operators). Their probability of 
application are called operator rates. Typically, crossover is applied with the 
highest probability, the crossover rate often being 90% or higher. On the 
contrary, the mutation rate is much smaller, typically being in the region of 
1 %. 

When the rates of crossover and mutation add up to a value p which is 
less than 100%, an operator called reproduction is also used, with a rate of 
1 — p. Reproduction simply involves the selection of an individual based on 
fitness and the insertion of a copy of it in the next generation. 





Chapter 3 



Getting Ready to Run 
Genetic Programming 



To apply a GP system to a problem, several decisions need to be made; 
these are often termed the preparatory steps. The key choices are: 

1. What it the terminal setl 

2. What is the function setl 

3. What is the fitness measure ? 

4. What parameters will be used for controlling the run? 

5. What will be the termination criterion, and what will be designated 
the result of the runl 



3.1 Step 1: Terminal Set 

While it is common to describe GP as evolving programs, GP is not typ- 
ically used to evolve programs in the familiar Turing-complete languages 
humans normally use for software development. It is instead more com- 
mon to evolve programs (or expressions or formulae) in a more constrained 
and often domain-specific language. The first two preparatory steps, the 
definition of the terminal and function sets, specify such a language. That 
is, together they define the ingredients that are available to GP to create 
computer programs. 
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The terminal set may consist of: 

• the program ’s external inputs. These typically take the form of named 
variables (e.g., x, y). 

• functions with no arguments. These may be included because they 
return different values each time they are used, such as the function 
rand() which returns random numbers, or a function dist_to_wall() 
that returns the distance to an obstacle from a robot that GP is con- 
trolling. Another possible reason is because the function produces side 
effects. Functions with side effects do more than just return a value: 
they may change some global data structures, print or draw something 
on the screen, control the motors of a robot, etc. 

• constants. These can be pre-specified, randomly generated as part of 
the tree creation process, or created by mutation. 

Using a primitive such as rand can cause the behaviour of an individual 
program to vary every time it is called, even if it is given the same inputs. 
This is desirable in some applications. However, we more often want a 
set of fixed random constants that are generated as part of the process of 
initialising the population. This is typically accomplished by introducing 
a terminal that represents an ephemeral random constant. Every time this 
terminal is chosen in the construction of an initial tree (or a new subtree 
to use in an operation like mutation) , a different random value is generated 
which is then used for that particular terminal, and which will remain fixed 
for the rest of the run. The use of ephemeral random constants is typically 
denoted by including the symbol 3? in the terminal set; see Chapter [4] for an 
example. 



3.2 Step 2: Function Set 



The function set used in GP is typically driven by the nature of the problem 
domain. In a simple numeric problem, for example, the function set may 
consist of merely the arithmetic functions (+, -, *, /). However, all sorts 
of other functions and constructs typically encountered in computer pro- 



grams can be used. Table 3.1 shows a sample of some of the functions one 
sees in the GP literature. Sometimes the primitive set includes specialised 
functions and terminals which are designed to solve problems in a specific 
problem domain. For example, if the goal is to program a robot to mop the 
floor, then the function set might include such actions as move, turn, and 
swish-the-mop. 
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Table 3.1: Examples of primitives in GP function and terminal sets. 



Function Set 


Kind of Primitive 


Example(s) 


Arithmetic 

Mathematical 

Boolean 

Conditional 

Looping 


+, *, / 

sin, cos, exp 
AND, OR, NOT 
IF-THEN-ELSE 
FOR, REPEAT 



Terminal Set 


Kind of Primitive 


Example (s) 


Variables 
Constant values 
0-arity functions 


x, y 
3, 0.45 
rand, go_left 



3.2.1 Closure 



For GP to work effectively, most function sets are required to have an impor- 
tant property known as closure (Koza 1992), which can in turn be broken 
down into the properties of type consistency and evaluation safety. 

Type consistency is required because subtree crossover (as described in 
Section 2.4 1 can mix and join nodes arbitrarily. As a result it is necessary 
that any subtree can be used in any of the argument positions for every func- 
tion in the function set, because it is always possible that subtree crossover 
will generate that combination. It is thus common to require that all the 
functions be type consistent, i.e., they all return values of the same type, 
and that each of their arguments also have this type. For example +, -, *, 
and / can can be defined so that they each take two integer arguments and 
return an integer. Sometimes type consistency can be weakened somewhat 
by providing an automatic conversion mechanism between types. We can, 
for example, convert numbers to Booleans by treating all negative values as 
false, and non-negative values as true. However, conversion mechanisms can 
introduce unexpected biases into the search process, so they should be used 
with care. 



The type consistency requirement can seem quite limiting but often sim- 
ple restructuring of the functions can resolve apparent problems. For exam- 
ple, an if function is often defined as taking three arguments: the test, the 
value to return if the test evaluates to true and the value to return if the 
test evaluates to false. The first of these three arguments is clearly Boolean, 
which would suggest that if can’t be used with numeric functions like +. 
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This, however, can easily be worked around by providing a mechanism to 
convert a numeric value into a Boolean automatically as discussed above. 
Alternatively, one can replace the 3-input if with a function of four (nu- 
meric) arguments a, b, c, d. The 4-input if implements “If a < b then return 
value c otherwise return value d" . 

An alternative to requiring type consistency is to extend the GP sys- 
tem. Crossover and mutation might explicitly make use of type information 
so that the children they produce do not contain illegal type mismatches. 
When mutating a legal program, for example, mutation might be required 
to generate a subtree which returns the same type as the subtree it has just 
deleted. This is discussed further in Section 1(01 

The other component of closure is evaluation safety. Evaluation safety 
is required because many commonly used functions can fail at run time. An 
evolved expression might, for example, divide by 0, or call MOVE_FORWARD 
when facing a wall or precipice. This is typically dealt with by modifying 
the normal behaviour of primitives. It is common to use protected versions 
of numeric functions that can otherwise throw exceptions, such as division, 
logarithm, exponential and square root. The protected version of a function 
first tests for potential problems with its input (s) before executing the cor- 
responding instruction; if a problem is spotted then some default value is 
returned. Protected division (often notated with 7.) checks to see if its second 
argument is 0. If so, 7. typically returns the value 1 (regardless of the value 
of the first argument) Q Similarly, in a robotic application a MOVE_AHEAD 
instruction can be modified to do nothing if a forward move is illegal or if 
moving the robot might damage it. 

An alternative to protected functions is to trap run-time exceptions and 
strongly reduce the fitness of programs that generate such errors. However, 
if the likelihood of generating invalid expressions is very high, this can lead 
to too many individuals in the population having nearly the same (very 
poor) fitness. This makes it hard for selection to choose which individuals 
might make good parents. 

One type of run-time error that is more difficult to check for is numeric 
overflow. If the underlying implementation system throws some sort of ex- 
ception, then this can be handled either by protection or by penalising as 
discussed above. However, it is common for implementation languages to 
ignore integer overflow quietly and simply wrap around. If this is unaccept- 
able, then the GP implementation must include appropriate checks to catch 
and handle such overflows. 



'The decision to return the value 1 provides the GP system with a simple way to 
generate the constant 1, via an expression of the form (’/, x x). This combined with a 
similar mechanism for generating 0 via (- x x) ensures that GP can easily construct 
these two important constants. 
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3.2.2 Sufficiency 

There is one more property that primitives sets should have: sufficiency. 
Sufficiency means it is possible to express a solution to the problem at hand 
using the elements of the primitive setj^J Unfortunately, sufficiency can be 
guaranteed only for those problems where theory, or experience with other 
methods, tells us that a solution can be obtained by combining the elements 
of the primitive set. 

As an example of a sufficient primitive set consider {AND, OR, NOT, xl, x2, 
..., xN}. It is always sufficient for Boolean induction problems, since it can 
produce all Boolean functions of the variables xl, x2, ..., xN. An example 
of insufficient set is {+, -, *, /, x, 0, 1, 2}, which is unable to represent 
transcendental functions. The function exp(a;), for example, is transcenden- 
tal and therefore cannot be expressed as a rational function (basically, a 
ratio of polynomials), and so cannot be represented exactly by any combi- 
nation of {+, -, *, /, x, 0, 1, 2}. When a primitive set is insufficient, GP 
can only develop programs that approximate the desired one. However, in 
many cases such an approximation can be very close and good enough for 
the user’s purpose. Adding a few unnecessary primitives in an attempt to 
ensure sufficiency does not tend to slow down GP overmuch, although there 
are cases where it can bias the system in unexpected ways. 

3.2.3 Evolving Structures other than Programs 

There are many problems where solutions cannot be directly cast as com- 
puter programs. For example, in many design problems the solution is an 
artifact of some type: a bridge, a circuit, an antenna, a lens, etc. GP has 
been applied to problems of this kind by using a trick: the primitive set is set 
up so that the evolved programs construct solutions to the problem. This is 
analogous to the process by which an egg grows into a chicken. For example, 
if the goal is the automatic creation of an electronic controller for a plant, 
the function set might include common components such as integrator, 
differentiator, lead, lag, and gain, and the terminal set might contain 
reference, signal, and plant output. Each of these primitives, when 
executed, inserts the corresponding device into the controller being built. 
If, on the other hand, the goal is to synthesise analogue electrical circuits, 
the function set might include components such as transistors, capacitors, 
resistors, etc. See Section |6.3| for more information on developmental GP 
systems. 



2 More formally, the primitive set is sufficient if the set of all the possible recursive 
compositions of primitives includes at least one solution. 
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3.3 Step 3: Fitness Function 

The first two preparatory steps define the primitive set for GP, and therefore 
indirectly define the search space GP will explore. This includes all the 
programs that can be constructed by composing the primitives in all possible 
ways. However, at this stage, we still do not know which elements or regions 
of this search space are good. I.e., which regions of the search space include 
programs that solve, or approximately solve, the problem. This is the task 
of the fitness measure, which is our primary (and often sole) mechanism 
for giving a high-level statement of the problem’s requirements to the GP 
system. For example, suppose the goal is to get GP to synthesise an amplifier 
automatically. Then the fitness function is the mechanism which tells GP 
to synthesise a circuit that amplifies an incoming signal. (As opposed to 
evolving a circuit that suppresses the low frequencies of an incoming signal, 
or computes its square root, etc. etc.) 

Fitness can be measured in many ways. For example, in terms of: the 
amount of error between its output and the desired output; the amount 
of time (fuel, money, etc.) required to bring a system to a desired target 
state', the accuracy of the program in recognising patterns or classifying 
objects; the payoff that a game-playing program produces; the compliance 
of a structure with user-specified design criteria. 

There is something unusual about the fitness functions used in GP that 
differentiates them from those used in most other evolutionary algorithms. 
Because the structures being evolved in GP are computer programs, fitness 
evaluation normally requires executing all the programs in the population, 
typically multiple times. While one can compile the GP programs that make 
up the population, the overhead of building a compiler is usually substantial, 
so it is much more common to use an interpreter to evaluate the evolved 
programs. 

Interpreting a program tree means executing the nodes in the tree in 
an order that guarantees that nodes are not executed before the value of 
their arguments (if any) is known. This is usually done by traversing the 
tree recursively starting from the root node, and postponing the evaluation 
of each node until the values of its children (arguments) are known. Other 
orders, such as going from the leaves to the root, are possible. If none 
of the primitives have side effects, the two orders are equivalent^] This 
depth- first recursive process is illustrated in Figure [3T| Algorithm |3 . 1 1 gives 
a pseudocode implementation of the interpretation procedure. The code 
assumes that programs are represented as prefix-notation expressions and 
that such expressions can be treated as lists of components. 



3 Functional operations like addition don’t depend on the order in which their argu- 
ments are evaluated. The order of side- effecting operations such as moving or turning a 
robot, however, is obviously crucial. 
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Figure 3.1: Example interpretation of a syntax tree (the terminal x is a 
variable and has a value of -1). The number to the right of each internal 
node represents the result of evaluating the subtree root at that node. 



procedure: eval( expr ) 

1: if expr is a list then 

2: proc = expr(l) {Non-terminal: extract root} 

3: if proc is a function then 

4: value = proc( eval(expr(2)), eval(expr(3)), ... ) {Function: evaluate 

arguments} 

5: else 

6: value = proc( expr(2), expr(3), ...) {Macro: don’t evaluate argu- 

ments} 

7: end if 

8: else 

9: if expr is a variable or expr is a constant then 

10: value = expr {Terminal variable or constant: just read the value} 

11: else 

12: value = expr() {Terminal 0-arity function: execute} 

13: end if 

14: end if 

15: return value 

Notes : expr is an expression in prefix notation, expr(l) represents the prim- 
itive at the root of the expression, expr(2) represents the first argument of 
that primitive, expr(3) represents the second argument, etc. 

Algorithm 3.1: Interpreter for genetic programming 
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3 Getting Ready to Run Genetic Programming 



In some problems we are interested in the output produced by a program, 
namely the value returned when we evaluate the tree starting at the root 
node. In other problems we are interested in the actions performed by a 
program composed of functions with side effects. In either case the fitness 
of a program typically depends on the results produced by its execution on 
many different inputs or under a variety of different conditions. For example 
the program might be tested on all possible combinations of inputs xl, x2, 
..., xN. Alternatively, a robot control program might be tested with the 
robot in a number of starting locations. These different test cases typically 
contribute to the fitness value of a program incrementally, and for this reason 
are called fitness cases. 

Another common feature of GP fitness measures is that, for many prac- 
tical problems, they are multi- objective, i.e., they combine two or more dif- 
ferent elements that are often in competition with one another. The area of 
multi-objective optimisation is a complex and active area of research in GP 
and machine learning in general. See Chapter [9] and also (Deb, 20011. 



Deb 2001 



3.4 Step 4: GP Parameters 



The fourth preparatory step specifies the control parameters for the run. 
The most important control parameter is the population size. Other control 
parameters include the probabilities of performing the genetic operations, the 
maximum size for programs and other details of the run. 

It is impossible to make general recommendations for setting optimal 
parameter values, as these depend too much on the details of the application. 
However, genetic programming is in practice robust, and it is likely that 
many different parameter values will work. As a consequence, one need not 
typically spend a long time tuning GP for it to work adequately. 

It is common to create the initial population randomly using ramped 



half-and-half (Section 2.2 1 with a depth range of 2-6. The initial tree sizes 
will depend upon the number of the functions, the number of terminals 
and the arities of the functions. However, evolution will quickly move the 
population away from its initial distribution. 

Traditionally, 90% of children are created by subtree crossover. How- 
ever, the use of a 50-50 mixture of crossover and a variety of mutations (cf. 
Chapter [5| also appears to work well. 

In many cases, the main limitation on the population size is the time 
taken to evaluate the fitnesses, not the space required to store the individ- 
uals. As a rule one prefers to have the largest population size that your 
system can handle gracefully; normally, the population size should be at 
least 500, and people often use much larger populations [[] Often, to a first 



4 There are, however, GP systems that frequently use much smaller populations. These 
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approximation, GP runtime can be estimated by the product of: the number 
of runs R, the number of generations G, the size of the population P , the 
average size of the programs s and the number of fitness cases F. 

Typically, the number of generations is limited to between ten and fifty; 
the most productive search is usually performed in those early generations, 
and if a solution hasn’t been found then, it’s unlikely to be found in a 
reasonable amount of time. The folk wisdom on population size is to make 
it as large as possible, but there are those who suggest using many runs with 
much smaller populations instead. Some implementations do not require 
arbitrary limits of tree size. Even so, because of bloat (the uncontrolled 



growth of program sizes during GP runs; see Section 11.31, it is common to 
impose either a size or a depth limit or both (see Section 11.3.2| ). 

Sometimes the number of fitness cases is limited by the amount of train- 
ing dat£0 available. In this case, the fitness function should use all of it. 
(One does not necessarily need to use verification or holdout data, since 



over-fitting can be avoided by other means, as discussed in Section 13.12 



page 140 ) In other cases, e.g. 22-bit even parity, there can almost be too 
much training data. Then the fitness function may be reduced to use just a 
subset of the training data. This does not necessarily have to be done man- 
ually as there are a number of algorithms that dynamically change the test 
set as the GP runs. (These and other speedup techniques will be discussed 
in Chapter 13 particularly Section 10.1 page 83 ) 

It is common to record these details in a tableau, such as Table |4.1| on 
page [TT| 



3.5 Step 5: Termination and solution desig- 
nation 

The fifth preparatory step consists of specifying the termination criterion 
and the method of designating the result of the run. The termination cri- 
terion may include a maximum number of generations to be run as well as 
a problem-specific success predicate. Typically, the single best-so-far indi- 
vidual is then harvested and designated as the result of the run, although 
one might wish to return additional individuals and data as necessary or 
appropriate for the problem domain. 



typically rely more on mutation than crossover for their primary search mechanism. 

5 Training data refers to the test cases used to evaluate the fitness of the evolved 
individuals. 



Chapter 4 

Example 

Genetic Programming Run 



This chapter provides an illustrative run of GP in which the goal is to 
automatically create a program with a target input /output behaviour. In 
particular, we want to evolve an expression whose values match those of 
the quadratic polynomial x 2 + x + 1 in the range [—1, +1]. The process of 
mechanically creating a computer program that fit certain numerical data 
is sometimes called system identification or symbolic regression (see Sec- 
tion 12.2 for more). 

We begin with the five preparatory steps from the previous chapter and 
then describe in detail the events in one run. 



4.1 Preparatory Steps 

The purpose of the first two preparatory steps is to specify the ingredients 
the evolutionary process can use to construct potential solutions. Because 
the problem is to find a mathematical function of one independent variable, 
x, the terminal set (the inputs of the to-be-evolved programs) must include 
this variable. The terminal set also includes ephemeral random constants 
drawn from some reasonable range Q say from —5.0 to +5.0, as described in 



^^What is a “reasonable” range is likely to be extremely problem dependent. While in 
theory you can build up large constants using small constants and arithmetic operators, 
the performance of your system is likely to improve considerably if you provide constants 
of roughly the right magnitude from the beginning. Your choice of genetic operators can 
also be important here. If you’re finding that your system is struggling to evolve the 
right constants, it may be helpful to introduce mutation operators specifically designed 
to search of the space of constants. 
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4 Example Genetic Programming Run 



Section |TT7T] Thus the terminal set, T, is 



T = {2,3?}. 

The statement of the problem does not specify which functions may be 
employed in the to-be-evolved program. One simple choice for the function 
set is the four ordinary arithmetic functions: addition, subtraction, mul- 
tiplication and division. Most numeric regression problems will require at 
least these operations, sometimes with additional functions such as sin() and 
log(). We will use the simple function set 

F = {+,-, *,%}, 



where "/, is protected division as discussed in Section 3.2.1 Note that the 



target polynomial can be expressed exactly using the terminal and function 



sets we have chosen, so these primitives are sufficient (cf. page 23) for the 
quadratic polynomial problem. 

The third preparatory step involves constructing the fitness measure that 
specifies what the user wants. The high-level goal of this problem is to find 
a program whose output is equal to the values of the quadratic polynomial 
:r 2 +:r+l. Therefore, the fitness assigned to a particular individual in the 
population must reflect how closely the output of an individual program 
comes to the target polynomial x 2 + x + 1. 

In principle, the fitness measure could be defined in terms of the mathe- 
matical integral of the difference between the evolved function and the target 
function. However, for most symbolic regression problems, it is not practical 
or even possible to compute the value of the integral analytically. Thus, it 
is common to define the fitness to be the sum of absolute errors measured 
at different values of the independent variable x in the range [—1.0, +1.0]. 
In particular, we will measure the errors for x £ {—1.0, —0.9, • • • , 0.9, 1.0}. 
A smaller value of fitness (error) is better; a fitness (error) of zero would 
indicate a perfect fit. With this definition, our fitness is (approximately) 
proportional to the area between the parabola x 2 + x + 1 and the curve 



representing the candidate individual (see Figure 4.2 for examples). 

The fourth step is where we set our run parameters. The population size 
in this small illustrative example will be just four. The population size for 
a run of GP typically consists of thousands of individuals, but we will use 
this tiny population size to keep the example manageable. The crossover 
operation is commonly used to generate about 90% of the individuals in 
the population; the reproduction operation (where a fit individual is sim- 
ply copied from one generation to the next) is used to generate about 8% 
of the population; the mutation operation is used to generate about 1% of 



the population; and the architecture-altering operations (see Section 6.1.21 
are used to generate perhaps 1% of the population. However, because this 
example involves an abnormally small population of only four individuals, 
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Table 4.1: Parameters for example genetic programming run 



Objective: 

Function set: 
Terminal set: 
Fitness: 
Selection: 
Initial pop: 

Parameters: 

Termination: 



Find program whose output matches x 2 + x + 1 over the 
range — 1 < x < +1. 

+, — , 7. (protected division), and x; all operating on floats 
x , and constants chosen randomly between —5 and +5 
sum of absolute errors for x G {—1.0, —0.9, . . . 0.9, 1.0} 
fitness proportionate (roulette wheel) non elitist 
ramped half-and-half (depth 1 to 2. 50% of terminals are 
constants) 

population size 4, 50% subtree crossover, 25% reproduction, 
25% subtree mutation, no tree size limits 
Individual with fitness better than 0.1 found 



the crossover operation will be used twice (each time generating one indi- 
vidual), which corresponds to a crossover rate of 50%, while the mutation 
and reproduction operations will each be used to generate one individual. 
These are therefore applied with a rate of 25% each. For simplicity, the 
architecture-altering operations are not used for this problem. 

In the fifth and final step we need to specify a termination condition. A 
reasonable termination criterion for this problem is that the run will continue 
from generation to generation until the fitness (or error) of some individual 
is less than 0.1. In this contrived example, our example run will (atypically) 
yield an algebraically perfect solution with a fitness of zero after just one 
generation. 



4.2 Step-by-Step Sample Run 



Now that we have performed the five preparatory steps, the run of GP can 



be launched. The GP setup is summarised in Table 4.1 



4.2.1 Initialisation 



GP starts by randomly creating a population of four individual computer 
programs. The four programs are shown in Figure |4T] in the form of trees. 

The first randomly constructed program tree (Figure [4. l} i) is equivalent 
to the expression x + 1. The second program (Figure [47Tb ) adds the constant 
terminal 1 to the result of multiplying a: by a: and is equivalent to a; 2 +l. The 
third program (Figure 4.1 :) adds the constant terminal 2 to the constant 
terminal 0 and is equivalent to the constant value 2. The fourth program 
(Figure 4.11) is equivalent to x. 
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4 Example Genetic Programming Run 



(a) 



(b) 



(c) 



(d) 




Figure 4.1: Initial population of four randomly created individuals of gen- 
eration 0. 



4.2.2 Fitness Evaluation 



Randomly created computer programs will typically be very poor at solving 
any problem. However, even in a population of randomly created programs, 
some programs are better than others. The four random individuals from 
generation 0 in Figure |4~T[ produce outputs that deviate by different amounts 
from the target function x 2 + x + 1. Figure 4.2 compares the plots of each of 
the four individuals in Figure |4~l[ and the target quadratic function x 2 +x+l. 
The sum of abs olute errors for the straight line x+1 (the first individual) is 
7.7 (Figure 4.2 1 ). The sum of absolute errors for the parabola x 2 +l (the 
second individual) is 11.0 (Figure [ 4 ( 2 ) 3 ). The sums of the absolute errors for 
the remaining two individuals are 17.98 (Figure fO) :) and 28.7 (Figure |4~2] f). 

As can be seen in Figure |4~2| the straight line x+1 (Figure |4~2) i) is closer 
to the parabola x 2 + x + 1 in the range from -1 to +1 than any of three other 
programs in the population. This straight line is, of course, not equivalent to 
the parabola x 2 + x + 1; it is not even a quadratic function. It is merely the 
best candidate that happened to emerge from the blind (and very limited) 
random search of generation 0. 

In the valley of the blind, 

the one-eyed man is king. 



4.2.3 Selection, Crossover and Mutation 

After the fitness of each individual in the population is found, GP then 
probabilistically selects the fitter programs from the population to act as 
the parents of the next generation. The genetic operations are applied to 
the selected individuals to create offspring programs. The important point 
is that our selection process is not greedy. Individuals that are known to be 
inferior still have some chance of being selected. The best individual in the 
population is not guaranteed to be selected and the worst individual in the 
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Figure 4.2: Graphs of the evolved functions from generation 0. The solid 
line in each plot is the target function x 2 + x + 1, with the dashed line 
being the evolved functions from the first generation (see Figure 4.1 1 . The 
fitness of each of the four randomly created individuals of generation 0 is 
approximately proportional to the area between two curves, with the actual 
fitness values being 7.7, 11.0, 17.98 and 28.7 for individuals (a) through (d), 
respectively. 



population will not necessarily be excluded. 

In this example we will start with the reproduction operation. Because 
the first individual (Figure [4.1^ ) is the most fit individual in the population, 
it is very likely to be selected to participate in a genetic operation. Let us 
suppose that this particular individual is, in fact, selected for reproduction. 
If so, it is copied, without alteration, into the next generation (generation 1). 
It is shown in Figure 4.3 1 as part of the population of the new generation. 

We next perform the mutation operation. Because selection is proba- 
bilistic, it is possible that the third best individual in the population (Fig- 
ure 4.1 :) is selected. One of the three nodes of this individual is then ran- 
domly picked as the site for the mutation. In this example, the constant 
terminal 2 is picked as the mutation site. This program is then randomly 
mutated by deleting the entire subtree rooted at the picked point (in this 
case, just the constant terminal 2) and inserting a subtree that is randomly 
constructed in the same way that the individuals of the initial random pop- 
ulation were originally created. In this particular instance, the randomly 
grown subtree computes x divided by x using the protected division oper- 
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4 Example Genetic Programming Run 



(a) 



(b) 



(c) 



(d) 




Figure 4.3: Population of generation 1 (after one reproduction, one muta- 
tion, and two one-offspring crossover operations). 



ation 1. The resulting individual is shown in Figure |1.3| >. This particular 
mutation changes the original individual from one having a constant value 
of 2 into one having a constant value of 1, improving its fitness from 17.98 
to 11.0. 



Finally, we use the crossover operation to generate our final two indi- 
viduals for the next generation. Because the first and second individuals in 
generation 0 are both relatively fit, they are likely to be selected to partic- 
ipate in crossover. However, selection can always pick suboptimal individ- 
uals. So, let us assume that in our first application of crossover the pair of 
selected parents is composed of the above-average tree in Figures |4.1^ and 
the below-average tree in Figure |4H~[ h One point of the first parent, namely 
the + function in Figure |4.1fr ,, is randomly picked as the crossover point for 
the first parent. One point of the second parent, namely the leftmost termi- 
nal x in Figure |4H~} 1, is randomly picked as the crossover point for the second 
parent. The crossover operation is then performed on the two parents. The 
offspring (Figure [4. 3|:) is equivalent to x and is not particularly noteworthy. 



Let us now assume, that in our second application of crossover, selection 
chooses the two most fit individuals as parents: the individual in Figure |47TJd 
as the first parent, and the individual in Figure 4.1 r as the second. Let us 
further imagine that crossover picks the leftmost terminal x in Figure |1 . 1 [ > 
as a crossover point for the first parent, and the + function in Figure |4Tkas 
the crossover point for the second parent. Now the offspring (Figure |473} l) 
is equivalent to x 2 + x + 1 and has a fitness (sum of absolute errors) of zero. 
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4.2.4 Termination and Solution Designation 

Because the fitness of the individual in Figure pH}} ] is below 0.1, the termina- 
tion criterion for the run is satisfied and the run is automatically terminated. 
This best-so-far individual (Figure |4~3} i) is then designated as the result of 
the run. 

Note that the best-of-run individual (Figure |4.3p l) inco rpora tes a good 
trait (the quadratic term x 2 ) from the first parent (Figure 4.1a) with two 
other good traits (the linear term x and constant term of 1) from the second 
parent (Figure pj~T} t) . The crossover operation thus produced a solution to 
this problem by recombining good traits from these two relatively fit parents 
into a superior (indeed, perfect) offspring. 

This is, obviously, a highly simplified example, and the dynamics of a 
real GP run are typically far more complex than what is presented here. 
Also, in general, there is no guarantee that an exact solution like this will 
be found by GP. 



Part II 



Advanced Genetic 
P rogramming 




In which a search is organdized . . . 
and Piglet encounters the Heffalump of Bloat. 
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Chapter 5 

Alternative Initialisations 
and Operators in 
Tree-based GP 



The genetic programming system described in the preceding chapters is just 
the beginning; in many ways it is the simplest thing that could possibly 
work. Most of the techniques described in Part |T] date back to the late 
1980’s and early 1990’s, a wide array of alternatives and extensions have 
been explored since. A full catalogue of these would be far beyond the 
scope of this book. The chapters in Part [II] survey a number of the more 
prominent or historically important extensions to GP, particularly (but not 
exclusively) in relation to the tree-based representation for programs. 



We start, in this chapter, by reviewing a variety of initialisation strategies 
(Section |5T| ) and genetic operators (Sections |5.2| and 5.3 1 for tree-based GP 
not covered in Part |T] We also briefly look at some hybridisations of GP 
with other techniques (Section |5.4|). 



5.1 Constructing the Initial Population 



Koza’s ramped half-and-half method is the most common way of creating the 



initial GP population (cf. Section 2.2 page 11). However, there are several 



other ways of constructing a collection of random trees. In Section |5.1.2| 
we will briefly consider an unexpected impact of population initialisation. 
There has also been some work with non-random or informed starting points 



(cf. Section 5.1.3). 
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5 Alternative Initialisations and Operators in Tree-based GP 



5.1.1 Uniform Initialisation 



The shape of the initial trees can be lost within a few generations (more on 
this below) . However, a good start given by the initial population can still be 
crucial to the success of a GP run. In general, there are an infinite number of 
possible computer programs. This means that it is impossible to search them 
uniformly. Therefore, any method used to create the initial population will 
have a bias. For example, ramped half-and-half tends to create bushy trees. 
Such trees have a higher proportion of solutions to symmetric problems, 
such as parity. Conversely, the smallest solution to the Sante Fe ant trail- 



following problem is more randomly shaped (Langdon and Poli 1998a). This 



is partly why ramped half-and-half is very poor at finding programs which 
can navigate the Sante Fe trail. Another reason is that many of the programs 
generated by ramped half-and-half (with standard parameters) are simply 



too small. Chellapilla (1997a) claims good results when the size of the initial 



trees was more tightly controlled. 



Iba (1996a) and Bohm and Geyer-Schulz (1996) report methods to pre- 



cisely sample trees uniformly based on Alonso’s bijective algorithm (Alonso 



and Schott 1995). Although this algorithm has been criticised (Luke 



2000) for being computationally expensive, it can be readily used in prac- 



tice. Langdon (2000) introduced the ramped uniform initialisation which 



extends Alonso’s bijective algorithm by allowing the user to specify a range 
of initial tree sizes. It then generates equal numbers of random trees 
for each length in the chosen range. (C++ code can be obtained from 
ftp : //cs . ucl . ac .uk/genetic/gp-code/rand_tree . cc.) 

With these more “uniform” initialisations, most trees are asymmetric 
with some leaves very close to the root of the tree. This is quite different 
from the trees generated by ramped half-and-half which are on average some 
distance from the root. Uniform sampling may be better in problems where 
the desired solutions are asymmetric with some leaves being much more 
important than others. For example, in data mining it is common to look 
for solutions with a few dominant variables (which may be close to the root 
node) whilst other variables are of little or no interest and may be some 
distance from the root (or indeed not present in the tree). On the other 
hand, problems like multiplexer or parity require all the inputs to be used 
and are of similar importance. Bushier trees may be better at solving such 
problems. 



5.1.2 Initialisation may Affect Bloat 

Crossover has a strong preference for creating a very non-uniform distri- 



butions of tree sizes (Poli, Langdon, and Dignum 2007). Crossover gener- 



ates very short programs much more often than longer ones. Selection can 
only partially combat this tendency. Typically, crossover will totally rear- 
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range the size and shape of the initial trees within a few generations. As 
discussed in Section 11.3.1 (page 101 1, the excessive sampling of short pro- 
grams appears to be an important cause of bloat (the uncontrolled growth 
of programs during GP runs, which will be described in more detail in Sec- 
tion 11.3 page ToT] onwards) . It has been shown (Dignum and Poli 2007) 
that when the initial population is created with the size distribution pre- 



ferred by crossover (see Section 11.3.11, bloat is more marked. The distri- 



bution has a known mathematical formula (it is a Lagrange distribution of 
the second kind), but in practice it can be created by simply performing 
multiple rounds of crossover on a population created in the traditional way 
before the GP run starts. This is known as Lagrange initialisation. These 
findings suggest that initialisation methods which tend to produce many 
short programs may in fact induce bloat sooner than methods that produce 
distributions more skewed towards larger programs. 



5.1.3 Seeding 

The most common way of starting a GP run from an informed non-random 
point is seeding the initial population with an individual which, albeit not 
a solution, is thought to be a good starting point. Such a seed may have 
been produced by an earlier GP run or perhaps constructed by the user 



(Aler, Borrajo, and Isasi |2002 Holmes) 1995| Hsu and Gustafson 2001 


Langdoi 


r and Nordin [ 2000 ; Langdon and Treleaven 1997 ; | Westerberg anc 


Levine 


20011. However, Marek, Smart, and Martin 


(2002) reported that 



hand written programs may not be robust enough to prosper in an evolving 
population. 

One point to be careful of is that such a seed individual is liable to be 
much better than randomly created trees. Thus, its descendants may take 
over the population within a few generations. So, under evolution the seeded 
population is initially liable to lose diversity rapidly. Furthermore, depend- 
ing upon the details of the selection scheme used, a single seed individual 
may have some chance of being removed from the population. Both problems 
are normally dealt with by filling the whole population with either identical 
or mutated copies of the seed. This method creates a low diversity initial 
population in a controlled way, thereby avoiding the initial uncontrolled loss 
of diversity associated with single seeds. Furthermore, with many copies of 
the seed, few selection methods will have much chance of removing all copies 
of the seed before they are able to create children. Diversity preserving tech- 



niques, such as multi-objective GP (e.g., (Parrott, Li, and Ciesielski 2005), 
(Setzkorn 20051 and Chapter [9|), demes (Langdon 1998|) (see Section 10.31, 



fitness sharing (Goldberg 1989) and the use of multiple seed trees, might 
also be good cures for the problems associated with the use of a single seed. 
In any case, the diversity of the population should be monitored to ensure 
that there is significant mixing of different initial trees. 
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5.2 GP Mutation 



5.2.1 Is Mutation Necessary? 

Mutation was used in early experiments in the evolution of programs, e.g., 



in 


Bickel and Bickel 


1987 


Cramer 


1985 


Fujiki and Dickinson 


1987). It 


was not, however, used in ( 


Koza 


1992) and 


Koza 


19941, as Koza wished to 



demonstrate that mutation was not necessary and that GP was not perform- 
ing a simple random search. This has significantly influenced the field, and 
mutation is often omitted from GP runs. While mutation is not necessary 



for GP to solve many problems, O’Reilly (1995) argued that mutation — in 



combination with simulated annealing or stochastic iterated hill climbing 
can perform as well as crossover-based GP in some cases. Nowadays, mu- 
tation is widely used in GP, especially in modelling applications. Koza also 
advises to use of a low level of mutation; see, for example, (|Koza, Bennett, 



Andre, and Keane 1996b). 



Comparisons of crossover and mutation suggest that including mutation 



can be advantageous. Chellapilla (1997b) found that a combination of six 



mutation operators performed better than previously published GP work on 



four simple problems. Harries and Smith (1997) also found that mutation 



based hill climbers outperformed crossover-based GP systems on similar 



problems. Luke and Spector (1997) suggested that the situation is complex, 
and that the relative performance of crossover and mutation depends on 
both the problem and the details of the GP system. 



5.2.2 Mutation Cookbook 

With linear bit string GAs, mutation usually consists of random changes in 
bit values. In contrast, in GP there are many mutation operators in use. 
Often multiple types of mutation are beneficially used simultaneously (e.g., 
see (Kraft, Petry, Buckles, and Sadasivan 19941 and (Angeline 1996)). We 
describe a selection of mutation operators below: 

Subtree mutation replaces a randomly selected subtree with another ran- 



domly created subtree ( Koza 1992 page 106). Kinnear (1993) defined 



a similar mutation operator, but with a restriction that prevents the 
offspring from being more than 15% deeper than its parent. 



Size-fair subtree mutation was proposed in two forms by |Langdon 
(19981. In both cases, the new random subtree is, on average, the 



same size as the code it replaces. The size of the random code is given 
either by the size of another random subtree in the program or chosen 
at random in the range [1/2,31/2] (where l is the size of the subtree 
being replaced). The first of these methods samples uniformly in the 
space of possible programs, whereas the second samples uniformly in 
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the space of program lengths. Experiments suggested that there was 
far more bloat (cf. Section 11.3.1 page 101 1 with the first mutation 
operator. 

Node replacement mutation (also known as point mutation) is similar 
to bit string mutation in that it randomly changes a point in the 
individual. In linear GAs the change would be a bit flip. In GP, 
instead, a node in the tree is randomly selected and randomly changed. 
To ensure the tree remains legal, the replacement node has the same 



number of arguments as the node it is replacing, e.g. (McKay, Willis, 



and Barton 1995 page 488). 



Hoist mutation creates a new offspring individual which is copy of a ran- 
domly chosen subtree of the parent. Thus, the offspring will be smaller 
than the parent and will have a different root node (Kinnear 1994a). 



Shrink mutation replaces a randomly chosen subtree with a randomly 
created terminal (Angeline 1996). This is a special case of subtree 
mutation where the replacement tree is a terminal. As with hoist 
mutation, it is motivated by the desire to reduce program size. 

Permutation mutation selects a random function node in a tree and then 



randomly permuting its arguments (subtrees). Koza (1992) used per- 



mutation in one experiment [page 600] where it was shown to have 
little effect. In contrast, Maxwell (1996) had more success with a mu- 



tation operator called swap , which is simply a permutation mutation 
restricted to binary non-commutative functions. 

Mutating constants at random Schoenauer, Sebag, Jouve, Lamy, and 



Maitournam (1996) mutated constants by adding random noise from 



a Gaussian distribution. Each change to a constant was considered a 
separate mutation. 

Mutating constants systematically A variety of potentially expensive 
optimisation tools have been applied to try and fine-tune an existing 
program by finding the “best” value for the constants within it. Indeed 



STROGANOFF (Iba, Sato, and de Garis 1995b Nikolaev and Iba 



2006) optimises each tree modified by crossover. Clever mechanisms 



are employed to minimise the computation required. 



(McKay et al. . 1995 page 489) is more in keeping with traditional GP 



and uses a mutation operator that operates on terminals, replacing in- 
put variables by constants and vice versa. In this approach “whenever 
a new constant is introduced [. . . ] a non-linear least squares optimisa- 
tion is performed to obtain the ‘best’ value of the constant(s)”. Schoe- 



nauer, Lamy, and Jouve ( 1995 ) also used a mutation operator that 
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affects all constants in an individual where “a numerical partial gra- 
dient ascent is achieved to reach the nearest local optimum” . Finally, 



Sharman, Esparcia Alcazar, and Li ( 1995 ) used simulated annealing to 



update numerical values (which represented signal amplification gains) 
within individuals. 



5.3 GP Crossover 



During biological sexual reproduction, the genetic material from both 
mother and father is combined in such a way that genes in the child are 
in approximately the same position as they were in its parents. This is quite 
different from traditional tree-based GP crossover, which can move a subtree 
to a totally different position in the tree structure. 

Crossover operators that tend to preserve the position of genetic ma- 
terial are called homologous, and several notions of homologous crossover 
have been proposed for GP. It is fairly straightforward to realise homolo- 
gous crossover when using linear representations, and homologous operators 



are widely used in linear GP (cf. Figure 7.4 page 


65 1 (Defoin Platel, Clergue, 


and Collard 2003; Francone, Conrads, Banzhaf, and Nordin 1999t|Hansen 


2003[ Hansen, Lowry, Meservy, and McDonald 


2007 ; Nordin, Banzhaf, and 


Francone 


1999 O’Neill, Ryan, Keijzer, and Cattolico), 2003). Various forms 


of homologous crossover have also been proposed for tree-based GP (Col- 


let 2007 


Langdon| |2000 Lones 2003| |MacCallum| 2003 Yamamoto and 


Tschudin 


20051. 



The oldest homologous crossover in tree-based GP is one-point crossover 
( Langdon and Poli 2002 Poli and Langdon 1997 1998a I . This works by se- 



lecting a common crossover point in the parent programs and then swapping 
the corresponding subtrees. To allow for the two parents having different 
shapes, one-point crossover analyses the two trees from the root nodes and 
selects the crossover point only from the parts of the two trees in the common 



region (see Figure 5.1 1 . In the common region, the parents have the same 
shapeQ The common region is related to homology, in the sense that the 
common region represents the result of a matching process between parent 
trees. Within the common region between two parent trees, the transfer of 
homologous primitives can happen like it does in a linear bit string genetic 
algorithm. 



Uniform crossover for trees (Poli and Langdon 1998b) works (in the 
common region) like uniform crossover in GAs. That is, the offspring are 
created by visiting the nodes in the common region and flipping a coin at 



1 Nodes in the common region need not be identical but they must have the same 
arity. That is, they must both be leaves or both be functions with the same number of 
inputs. 
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Parent 1 Parent 2 




Offspring 1 Offspring 2 




Alignment 




Selection of 
Common 
Crossover Point 




Figure 5.1: Example of one-point crossover between parents of different 
sizes and shapes. 



each locus to decide whether the corresponding offspring node should be 
picked from the first or the second parent. If a node to be inherited belongs 
to the base of the common region and is a function, then the subtree rooted 
there is inherited as well. With this form of crossover, there can be a greater 
mixing of the code near the root than with other operators. 



In context-preserving crossover ( 


D’haeseleer 


1994), the crossover points 


are constrained to have the same coordinates, 


.ike in one-point crossover. 



Note that the crossover points are not limited to the common region. 



In size-fair crossover ( 


Langdon 


1999a 


2000) the first crossover point is 


selected randomly, as with standarc 


crossover. Then the size of the subtree 



to be removed from the first parent is calculated. This is used to constrain 
the choice of the second crossover point so as to guarantee that the subtree 
excised from the second parent will not be “unfairly” big. 



Harries and Smith (1997) suggested five new crossover operators that 



are like standard crossover but with probabilistic restrictions on the depth 
of crossover points within the parent trees. 



Since crossover and mutation are specific to the representation used in 
GP, each new representation tends to need new crossover and mutation 
For example “ripple crossover 



operators. 



(O’Neill et al. 



of looking at crossover in grammatical evolution (Section 



2003) 



6.2.3 



is a way 
page 



551. 
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As we shall see in Chapter [7J specific crossover operators exist for lin- 
ear GP (Section and graph based GP systems (Sectio n |7.2 [ ), such as 
PDGP (page [65]) , PADO (page [67]) and Cartesian GP (page [67|b 



5.4 Other Techniques 



GP can be hybridised with other techniques. For example, Iba, de Garis, and| 
Sato (19941, Nikolaev and Iba (20061, and Zhang and Miihlenbein (19951 



have incorporated information theoretic and minimum description length 
ideas into GP fitness functions to provide a degree of regularisation and 



so avoid over-fitting (and bloat, see Section 11.3 1. As mentioned in Sec- 
tion [67273] computer language grammars can be incorporated into GP. 

Whereas genetic programming typically uses an evolutionary algorithm 
to search the space of computer programs, various other heuristic search 
methods can also be applied to program search, including: enumeration 
(Olsson |1995| ), hill climbing (O’Reilly and Oppacher 1994a I, and simu- 
lated annealing ( |Q’Re illy 1996 Tsoulos and Lagaris 2006). As discussed 
in Chapter [8j it is also possible to extend Estimation of Distribution Algo- 
rithms (EDAs) to the variable size representations used in GP. 

Another alternative is to use co- evolution with multiple populations, 
where the fitness of individuals in one population depends on the behaviour 
of individuals in other populations. There have been many successful appli- 



cations of co-evolution in GP, including ( Azaria and Sipper, 2005a Brameier, 



Haan, Krings, and MacCallum 2006 Buason, Bergfeldt, anc 


Ziemke 2005 


Channon 2006 Dolinsky, Jenkinson, and Colquhoun 2007 


Funes, Sklar, 


Juille, and Pollack 


1998a Gagne and Parizeau, 2007 Hillis 


1992| Hornby 


and Pollack, 2001| Mendes, de B. Voznika, Nievola, and Freitas[ 2001| |Pi- 


aseezny, Suzuki, and Sawai( 2004; Schmidt and Lipson 2006 


Slrarabi and 


Sipper 2006 


Soule 


2003 Soule and Komireddy; 2006; Spector 


2002; Spec- 


tor and Klein 


2006 


Spector, Klein, Perry, and Feinstein 2005b 


Wilson and 



Heywood 2007 Zhang and Clro 1999). 



Finally, it is worth mentioning that program trees can be manipulated 
with editing operations (Koza 1992). For example, if the root node of 



a subtree is x but one of its arguments is always guaranteed to evaluate 
to 0, then we can replace the subtree rooted there with the terminal 0. 
If the root node of a subtree is + and one argument evaluates to 0, we 
can replace the subtree with the other argument of the +. Editing can 
reduce the complexity of evolved solutions and can make them easier to 
understand. However, it may also lead to GP getting stuck in local optima, 
so editing operations should probably be used sparingly at run time. Other 
reorganisation operations of various types are also possible. For example, 



after trees are generated by GP, ( Garcia- Almanza and Tsang 2006 2007) 



prune branches and combine branches from different trees. 



Chapter 6 

Modular, Grammatical 
and Developmental 
Tree-based GP 



This chapter discusses advanced techniques that are primarily focused on 
two important issues in genetic programming: modularity and constraint. 
In Section 6.1 we explore the evolution of modular, hierarchical structures, 
and in Section 6.2 we looks at ways of constraining the evolutionary process, 
typically based on some sort of domain knowledge. We also look at using 
GP to evolve programs which themselves develop solutions (Section 6.3 1 or 
even construct other programs (Section |6.4[). 



6.1 Evolving Modular and 
Hierarchical Structures 



The construction of any highly complex object or individual, whether an oak 
tree or an airliner, typically uses hierarchical, modular structures to manage 
and organise that complexity. Animals develop in a highly regular way 
that yields a hierarchical structure of components ranging from systems and 
organs down to cells and organelles. GP, as described so far, is typically used 
to evolve expressions that, while being suitable solutions to many problems, 
rarely exhibit any large-scale modular structure. 

Given the pervasiveness of hierarchical structure as an organisational tool 
in both biology and engineering, it seems likely that such modular structure 
could be valuable in genetic programming as well. Consequently, this has 
been a subject of study from the early days of genetic programming. For 



example, Angeline and Pollack (1992) created dynamic libraries of subtrees 
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taken from parts of fit GP trees. Special mutation operations allowed the 
GP population to share code by referring to the same code within the li- 
brary. Subsequently, Angeline suggested that the scheme’s advantages lay 
in allowing GP individuals to access far more code than they actually “held” 
within themselves, rather than principally in developing more modular code. 



Rosea and Ballard ( 1996a) used a similar scheme, but were able to use much 



more information from the fitness function to guide the selection of the code 
to be inserted into the library and its subsequent use by members of the GP 
population. Olsson ( 1999 1995 1 later developed an abstraction operator for 
use in his AD ATE system, where sub- functions (anonymous lambda expres- 
sions) were automatically extracted. Unlike Angeline’s library approach, 
Olsson’s modules remained attached to the individual they were extracted 
from. 

Koza’s automatically defined functions (ADFs) ( Koza[ 1994| ) remain the 
most widely used method of evolving reusable components and have been 
used successfully in a variety of settings. Basic ADFs (covered in Sec- 
tion 6.1.1 1 use a fixed architecture specified in advance by the user. Koza 
later extended this using architecture altering operations (Section 6.1.21, 
which allow the architecture to evolve along with the programs. 



6.1.1 Automatically Defined Functions 



Human programmers organise sequences of repeated steps into reusable com- 
ponents such as subroutines, functions and classes. They then repeatedly 
invoke these components, typically with different inputs. Reuse eliminates 
the need to “reinvent the wheel” every time a particular sequence of steps 
is needed. Reuse also makes it possible to exploit a problem’s modularities, 
symmetries and regularities (thereby potentially accelerate the problem- 
solving process). This can be taken further, as programmers typically or- 
ganise these components into hierarchies in which top level components call 
lower level ones, which call still lower levels, etc. Koza’s ADFs provide a 
mechanism by which the evolutionary process can evolve these kinds of po- 
tentially reusable components. We will review the basic concepts here, but 



ADFs are discussed in great detail in (Koza 1994). 

When ADFs are used, a program consists of multiple components. These 
typically consist of one or more function-defining branches (i.e., ADFs), as 
well as one or more main result-producing branches (the RPB ), as illustrated 
in the example in Figure |6.1| The RPB is the “main” program that is 
executed when the individual is evaluated. It can, however, call the ADFs, 
which can in turn potentially call each other. A single ADF may be called 
multiple times by the same RPB, or by a combination of the RPB and other 
ADFs, allowing the logic that evolution has assembled in that ADF to be 
re-used in different contexts. 
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Figure 6.1: Example of program structure with two automatically-defined 
functions (ADF1 and ADF2) and one result-producing branch (RPB). 



Consider, for example, the following individual consisting of a result- 
producing branch and a single ADF : 



RPB : ADF(ADF(ADF(x))) 
ADF : argO x argO 



( 6 . 1 ) 

( 6 . 2 ) 



The ADF (Equation |6.2| ) is simply the squaring function, but by combining 
this multiple times in the RPB (Equation 6.1 1 this individual computes x 8 
in a highly compact fashion. 

It is important to not be fooled by a tidy example like this. ADFs 
evolved in real applications are typically complex and can be very difficult 
to understand. Further, simply including ADFs provides no guarantee of 
modular re-use. As is discussed in Chapter [l3j there are no silver bullets. 
It may be that the RPB never calls an ADF or only calls it once. It is also 
common for an ADF to not actually encapsulate any significant logic. For 
example, an ADF might be as simple as a single terminal, in which case it 
is essentially just providing a new name for that terminal. 

In Koza’s approach, each ADF is attached (as a branch) to a specific indi- 
vidual in the population. This is in contrast to both Angeline’s and Rosca’s 
systems mentioned above, both of which have general pools of modules or 
components which are shared across the population. Sometimes recursion 
is allowed in ADFs, but this frequently leads to infinite computations. Typ- 
ically, recursion is prevented by imposing an order on the ADFs within an 
individual and by restricting calls so that ADFj can only call ADF_,- if i < j. 

In the presence of ADFs, recombination operators are typically con- 
strained to respect the larger structure. That is, during crossover, a subtree 
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from ADF; can only be swapped with a subtree from another individual’s 

ADF ?; . 

The program’s result-producing branch and its ADFs typically have dif- 
ferent function and terminal sets. For example, the terminal set for ADFs 
usually include arguments, such as argO, argl. Typically the user must 
decide in advance the primitive sets, the number of ADFs and any call re- 
strictions to prevent recursion. However, these choices can be evolved using 
the architecture-altering operations described in Section 

Koza also proposed other types of automatically evolved program corn- 

1999). Automatically defined 



ponents (Koza, Andre, Bennet, and Keane 



iterations (ADIs), automatically defined loops (ADLs) and automatically 
defined recursions (ADRs) provide means to reuse code. Automatically de- 
fined stores (ADSs) provide means to reuse the result of executing code. 



6.1.2 Program Architecture and Architecture- Altering 
Operations 



Koza ( 1994| ) defined the architecture of a program to be the total number of 
trees, the type of each tree (e.g., RPB, ADF, ADI, ADL, ADR, or ADS), the 
number of arguments (if any) possessed by each tree, and, finally, if there is 
more than one tree, the nature of the hierarchical references (if any) allowed 
among the trees (e.g., whether ADF1 can call ADF2). 

There are three ways to determine the architecture of the computer pro- 
grams that will be evolved: 

1. The user may specify in advance the architecture of the overall pro- 
gram, i.e., perform an architecture- defining preparatory step in addi- 
tion to the five steps itemised in Chapter [3] 

2. A run of genetic programming may employ the evolutionary design 



of the architecture (as described in (Koza, 1994)), thereby enabling 
the architecture of the overall program to emerge from a competitive 
process during the run. 



3. The run may employ a set of architecture- altering operations (Koza 



1994 1995 Koza, Bennett, Andre, and Keane 1999) which can cre- 



ate new ADFs, remove ADFs, and increase or decrease the number 
of inputs an ADF has. Note that many architecture changes (such 
as those defined in (Koza, 1994| )) are designed not to initially change 
the semantics of the program and, so, the altered program often has 
exactly the same fitness as its parent. Nevertheless, the new arrange- 
ment of ADFs may make it easier for subsequent changes to evolve 
better programs later. 
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Koza and his colleagues have used these architecture altering operations 
quite widely in their genetic design work, where they evolve GP trees that 
encode a collection of developmental operations that, when interpreted, gen- 
erate a complex structure like a circuit or an optical system (see, for example, 
page 1181. 



Section 12.3 



The idea of architecture altering operations was extended to the ex- 
tremely general Genetic Programming Problem Solver (GPPS), which is 
described in detail in (Koza et al. 1999 part 4). This is an open ended 



system which combines a small set of basic vector-based primitives with the 
architecture altering operations in a way that can, in theory, solve a wide 
range of problems with almost no input required from the user other than 
the fitness function. The problem is that this open-ended system needs a 
very carefully constructed fitness function to guide it to a viable solution, an 
enormous amount of computational effort, or both. As a result it is currently 
an idea of more conceptual than practical value. 



6.2 Constraining Structures 



As discussed in Section |3.2.1| most GP systems require type consistency 
where all subtrees return data of the same type. This ensures that the out- 
put of any subtree can be used as one of the inputs to any node. The basic 
subtree crossover operator shuffles tree components entirely randomly. Uni- 
versal type compatibility ensures that crossover cannot lead to incompatible 
connections between nodes. This is also required to stop mutation from 
producing illegal programs. 

An implicit assumption underlying this approach is that all combinations 
of structures are equally likely to be useful. In many cases, however, we know 
in advance that there are constraints on the structure of the solution, or we 
have strong suspicions about the likely form solutions will take. In this 
section, we will look at several different systems that use tools such as types 
and grammars to bias or constrain search with the primary aim of increasing 
the chance of finding a suitable program. 

A problem domain might be naturally represented with multiple types. 
This suggests that the functions used by GP and their arguments will not 
necessarily be all of the same type. This can often be addressed through 
creative definitions of functions and implicit type conversion. For example, 



the Odin system ( Holmes and Barclay 1996 ) defines operations on inappro- 



priate types to return a new fail object. These are handled by introducing 
a binary fatbar that returns its first argument unless it is fail , in which case 
it returns its second argument. 

This sort of approach may not always be desirable. For example, if a 
key goal is to evolve solutions that can be easily understood or analysed, 
then one might prefer a GP system that is constrained structurally or via 
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a type system, since these often generate results that are more comprehen- 



sible (Haynes, Wainwright, Sen, and Schoenefeld 1995), (Langdon 1998 



page 126). Similarly, if there is domain knowledge that strongly suggests a 
particular syntactic constraint on the solution, then ignoring that constraint 
may make it much harder to find a solution. 

We will focus here on three different approaches to constraining the syn- 
tax of the evolved expression trees in GP: simple structure enforcement 



(Section 6.2.11, strongly typed GP (Section 6.2.21 and grammar-based con- 
straints (Section 6.2.31. Finally, we consider the advantages and disadvan- 
tages of syntactic and type constraints and their biases (Section |6.2.4|. 



6.2.1 Enforcing Particular Structures 

If a particular structure is believed or known to be important then one 
can modify the GP system to require that all individuals have that struc- 



ture (Koza 1992). For example, if a problem is believed to have (or require) 



a periodic solution, one might want to consider constraining the search to 
solutions of the form a x sin(6 x t). By allowing a and b to evolve freely 
but keeping the rest of the structure fixed, one could restrict GP to evolving 
expressions that are periodic. Syntax restrictions can also be used to make 
GP follow sensible engineering practices. For example, we might want to 
ensure that loop control variables appear in the correct parts of for loops 



and nowhere else (Langdon 1998, page 126). 

Enforcing a user specified structure on the evolved solutions can be imple- 
mented in a number of ways. One could ensure that all the initial individuals 
have the structure of interest (for example, generating random subtrees for 
a and b while fixing the rest) and then constrain crossover and mutation 
so that they do not alter any of the fixed regions of a tree. An alternative 
approach would be to evolve the various (sub)components separately. One 
could evolve pairs of trees (a,b) (like ADFs). Alternatively, one could have 
two separate populations, one of which is used to evolve candidates for a 
while the other is evolving candidates for b. 



A form of constraint-directed search in GP was also proposed in (Tsang 



and Jin 2006 Tsang and Li 2002) to help GP to focus on more promising 



areas of the space. 



6.2.2 Strongly Typed GP 

Since constraints are often driven by or expressed using a type system, a 
natural approach is to incorporate types and their constraints into the GP 



system (Montana 1995). In strongly typed GP, every terminal has a type, 



and every function has types for each of its arguments and a type for its 
return value. The process that generates the initial, random expressions, 
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and all the genetic operators are implemented so as to ensure that they do 
not violate the type system’s constraints. 

Returning to the if example from Section |3.2.1| (page|2l|), we might have 
an application with both numeric and Boolean terminals (e.g., get_speed 
and is_f oocLahead). We might then have an if function that takes three 
arguments: a test (Boolean), the value to return if the test is true, and 
the value to return if the test is false. Assuming that the second and third 
values are numbers, then the output of the if is also going to be numeric. 
If we choose the test argument as a crossover point in the first parent, then 
the subtree (excised from the second parent) to insert must have a Boolean 
output. That is, we must find either a function which returns a Boolean or 
a Boolean terminal in the other parent tree to be the root of the subtree 
which we will insert into the new child. Conversely if we choose either the 
second or third argument as a crossover point in the first parent, then the 
inserted subtree must be numeric. In all three cases, given that both parents 
are type correct, restricting the second crossover point in this way ensures 
the child will also be type correct. 

This basic approach to types can be extended to more complex type sys- 



tems including simple generics (Montana 19951, multi-level type systems 



(Haynes, Schoenefeld, and Wainwright 1996), fully polymorphic types (Ols- 



son 



1994), and polymorphic higher-order type systems (Yu 2001). 



6.2.3 Grammar-based Constraints 

Another natural way to express constraints is via grammars, and these have 
been used in GP in a variety of ways (Gruau [1996 Hoai, McKay, and 
Abbass| |2003| [O’Neill and Ryan[ [2003| |Whigham[ 1996 Wong and Leung[ 
1996). Many of these simply use a grammar as a means of expressing the 



kinds of constraints discussed above in Section 6.2.1 For example, one could 



enforce the structure for the period function using a grammar such as the 
following: 



tree 


::= E x 


sin(E x t) 




(6.3) 


E 


::= var 


| (EopE) 






op 


::= + 


1 x | 


4- 




var 


::= x | 


y 1 ^ 







Each line in this grammar is known as a rewrite rule or a production rule. 
Elements that cannot be rewritten are known as the terminals of the gram- 
mar[^| while symbols that appear on the left-hand-side of a rule are known 
as non-terminal symbols. 



1 Not to be confused with the terminals in the primitive set of a GP system. 
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Figure 6.2: Example individual (a derivation tree) that might be evolved 
in Whigham’s grammar-based GP system ( Whighaml 1996) if the grammar 



in Equation (6.3 1 was used. Rectangles represent non-terminal symbols of 
the grammar. 



In this sort of system, the grammar is typically used to ensure the initial 
population is made up of legal “grammatical” programs. The grammar is 
also used to guide the operations of the genetic operators. Thus we need to 
keep track not only of the program itself, but also the syntax rules used to 
derive it. 

What actually is evolved in a grammar-based GP system depends on the 
particular system. Whigham (1996), for example, evolved derivation trees, 



which effectively are a hierarchical representation of which rewrite rules must 
be applied, and in which order, to obtain a particular program. Figure 6.2 
shows an example of a derivation tree representing a grammatical program 
with respect to the grammar in Equation (6.3 1 . In this system, crossover is 



restricted to only swapping subtrees deriving from a common non-terminal 
symbol in the grammar. So, for example, a subtree rooted by an E node 
could be replaced by another also rooted by an E, while an U-rooted subtree 
could not be replaced by an op-rooted one. 

The actual program represented by a derivation tree can be obtained by 
reading out the leaves of the tree one by one from left to right. For the 



derivation tree in Figure 6.2 for example, this produces the program 



y x sin((a: + z) x t). 



However, for efficiency reasons, in an actual implementation it is not con- 
venient to extract the program represented by a derivation tree is this way. 



6.2 Constraining Structures 



55 



This is because programs need to be executed in order to evaluate their 
fitness, and this flat program representation often requires further trans- 
formations before execution. It is, therefore, common to directly convert a 
derivation tree into a standard GP tree. 

Grammar-based GP approaches can be extended by incorporating con- 
cepts from computational linguistics. For example, McKay and colleagues 



used tree adjoining grammars (TAGs) (Joshi and Schabes 1997) to de- 



sign new genetic representations and operators that respect grammatical 



constraints while allowing new types of structural modifications 


(Hoai and 


McKay 2004 


Hoai et al. 2003 


Hoai, McKay, Essam, and Hao 


1005| . 



Another major grammar-based approach is grammatical evolution (GE) 

GE does not 



O’Neill and Ryan 


2003 


Ryan, Collins, and O’Neill 


1998 



use trees, instead it represents individuals as variable-length sequences of 



integers (cf. Equation 6.4 1 which are interpreted in the context of a user 
supplied grammar. 

For each rule in the grammar, the set of alternatives on the right hand 
side are numbered from 0 upwards. In the example grammar in Equa- 



tion (6.3 1 above, the first rule only has one option on the right hand side; so 
this would be numbered 0. The second rule has two options, which would 
be numbered 0 and 1. The third rule has four options which would be num- 
bered 0 to 3. Finally the fourth rule has three options numbered 0 to 2. To 
create a program from a GE individual one uses the values in the individual 
to “choose” which alternative to take in the production rules. For example, 
suppose a GE individual is represented by the sequence 



39,7,2,83,66,92,57,80,47,94 



(6.4) 



then we start with 39 and the first syntax rule, tree. However tree has no 
alternatives, so we move to 7 and rule E. Now E has two alternatives and 
7 is used (via modulus) to chose between them. More of the translation 
process is given in Figure |6.3| 

In this example we did not need to use all the numbers in the sequence 
to generate a complete program. Indeed the last integer, 94, was not used. 
In general, “extra” genetic material is simply ignored. More problematic is 
when a sequence is “too short” in the sense that the end of the sequence 
is reached before the translation process is complete. There are a variety 
of options in this case, including failure (assigning this individual the worst 
possible fitness) and wrapping (continuing the translation process, moving 
back to the front of the numeric sequence). Grammatical evolution has been 
very successful and is widely used. 



6.2.4 Constraints and Bias 

One of the common arguments in favour of constraint systems like types 
and grammars is that they limit the search space by restricting the kind 
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tree 

( 39 mod 1 = 0,i.e., there is only one option ) 
E x sin(£' x t) 

( 7 mod 2 = 1, i.e., choose second option ) 

( E op E ) x sin(£’ x t) 

( 2 mod 2 = 0, i.e., take the first option ) 

(var op E) x sin(£' x t ) 

( 83 mod 3 = 2, pick the third variable, ) 

(* op E) x sin(i5 x t ) 

( 66 mod 4 = 2, take the third operator ) 

(z x E) x sin (E x t) 

(z x x) x sin( 2 : x t ) 



Figure 6.3: Sample grammatical evolution derivation using the grammar in 
Equation ( 6.3 1 and the integer sequence in Equation ( 6.4 1 . The non-terminal 
to be rewritten is underlined in each case. 



of structures that can be constructed. While this is true, it can come at a 
price. 

An expressive type system typically requires more complex machinery to 
support it. It often makes it more difficult to generate type-correct individ- 
uals in the initial population or during mutation and it is more difficult to 
find crossover points that do not violate the type system. In an extreme case 



like, constructive type theory (Thompson 19911, the type system is so pow- 



erful that it can completely express the formal specification of the program. 
Thus, any program/expression having this type is guaranteed to meet that 
specification. In GP this would mean that all the members of the initial 
population would need to be solutions to the problem, thus putting all the 
problem solving burden on the initialisation phase and removing the need 
for any evolution at all! Even without such extreme constraints, it has often 
been found necessary to develop additional machinery in order to efficiently 



generate an initial population that satisfies the constraints (Montana 1995 
Ratle and Sebag 2000 Schoenauer and Sebag 2001 |Yul|2001| ). 

As a rule, systems that focus on syntactic constraints (such as grammar- 
based systems) require less machinery than those that focus on semantic 
constraints (such as type systems), since it is typically easier to satisfy the 
syntactic constraints in a mechanistic fashion. For example, grammar-based 
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systems, such as grammatical evolution and the various TAG-based systems, 
are typically simple to initialise, and mutation and crossover need to enforce 
few, if any, constraints on the new child. The work (and the bias) in these 
systems is much more in the choice of the grammar, and once it has been 
designed, there is often little additional work required of the practitioner or 
the GP system to enforce the implied constraints. 

While a constraint system may limit the search space in valuable ways 
( Ratle and Sebag 2000 1 and can improve performance on interesting prob- 
lems (Hoai, McKay, and Essam 2006), there is no general guarantee that 
constraint systems will make the evolutionary search process easier. There 
is no broad assurance that a constraint will increase the density of solu- 
tions or (perhaps more importantly) approximate solutions^] Also, while 



there are cases where constraint systems smooth the search landscape (Hoai 
et al. 2006), it is also possible for constraint systems to make the search 



landscape more rugged by preventing genetic operations from creating inter- 
mediate forms on potentially valuable evolutionary paths. In the future, it 
might be useful to extend solution density studies such as those summarised 
in ( Langdon and Poli[ 2002 1 to the landscapes generated by constraint sys- 
tems in order to better understand the impact of these constraints on the 
underlying search spaces. 

In summary, while types, grammars, and other constraint systems can 
be powerful tools, all such systems carry biases. One therefore needs to be 
careful to explore the biases introduced by the constraints and not simply 
assume that they are beneficial. 
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By using appropriate terminals, functions and/or interpreters, GP can go 



beyond the production of computer programs. In cellular encoding (Gruau 
1994| Gruau and Whitley} |1993 Gruau[ 1994), programs are interpreted 



as sequences of instructions which modify (grow) a simple initial structure 
(embryo). Figure [6A] shows part of the development of an electronic circuit)^] 
Once the program has finished, the quality of the structure it has produced 
is measured and this is taken to be the fitness of the program. 

Naturally, for cellular encoding to work the primitives of the language 
must be able to grow structures appropriate to the problem domain. Typical 
instructions involve the insertion and/or sizing of components, topological 



2 By “solution density” we refer to the ratio between the number of acceptable solutions 
in a program search space and the size of the search space itself. This is a rough assessment 
of how hard a problem is, since it gives an indication of how long random search would 
take to explore the program space before finding an acceptable solution. 

'’’The process is easier to explain with a movie. This can be downloaded from http: 
//www . genetic-programming . com/gpdevelopment . html 
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Figure 6.4: Screen shot of an animated gif showing the development 
of the topology and the sizing of an electrical circuit (from http://www. 
genetic-programming, com/gpdevelopment .html). The program is inter- 
preted in parallel. Solid arrows link the active code to the parts of the 
electronic circuit (lower half) that are being modified. The three-headed 
arrow from S shows that three new components (Z4) have just been created 
in series. Their types (e.g., capacitor, inductor or resistor) and values will 
be determined by the three arguments of the S “function”. 



modifications of the structure, etc. Cellular encoding GP has successfully 



been used to evolve neural networks (Gruau 1994 Gruau and Whitley 



1993 



Gruau 


1994 


i and electronic circuits ( 


Koza et al. 


1999 


Koza, Andre, 



Bennett, and Keane 1996a Koza, Bennett, Andre, and Keane 1996c I, as 



well as in numerous other domains. A related approach proposed by |Hoang)] 



Essam, McKay, and Nguyen 


(2007) combines tree adjoining grammars (Sec- 


tion 


6.2.3 


) with L-systems 


Lindenmayer 


1968) to create a system where 



each stage in the developmental process is a working program that respects 
the grammatical constraints. 



One of the advantages of indirect representations such as cellular en- 
coding is that the standard GP operators can be used to evolve structures 
(such as circuits) which may have nothing in common with standard GP 
trees. In many of these systems, the structures being “grown” are also still 
meaningful (and evaluable) at each point in their development. This allows 
fitness evaluation. Another important advantage is that structures result- 
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ing from developmental processes often have some regularity, which other 
methods obtain through the use of ADFs, constraints, types, etc. A dis- 
advantage is that, with cellular encoding, individuals require an additional 
genotype-to-phenotype decoding step. However, when the fitness function 
involves complex calculations with many fitness cases, the relative cost of the 
decoding step is often small compared with the rest of the fitness function. 



6.4 Strongly Typed 
with PushGP 



Autoconstructive GP 



While types are often used to constrain evolution, Spector’s PushGP (Klein 



and Spector 


2007 


Robinson and Spector[ 2002| Spector[ 2001| Spector, 


Klein, and Keijzer 


2005a 


) is a move away from constraining evolution. 


Essentially PushGP uses genetic programming to automatically create 



programs written in the Push programming language. Push is a strongly 
typed tree based language which does not enforce syntactic constraints. 
Each of Push’s types has its own stack. In addition to stacks for inte- 
gers, floats, Booleans and so on, there is a stack for objects of type program. 
Using this code stack, Push naturally supports both recursion and program 



modules (see Section 6.1.1 1 without human pre-specification. The code stack 
allows an evolved program to push itself or fragments of itself onto the stack 
for subsequent manipulation. 

PushGP can use the code stack and other operations to allow programs to 
construct their own crossover and other genetic operations and create their 
own offspring. Programs are prevented from simply duplicating themselves 
to deflect catastrophic loss of population diversity. 



Chapter 7 

Linear and Graph 
Genetic Programming 



Until now we have been talking about the evolution of programs expressed 
as one or more trees which are evaluated by a suitable interpreter. This is 
the original and most widespread type of GP, but there are other types of 
GP where programs are represented in different ways. This chapter will look 
at linear programs and graph- like (parallel) programs. 



7.1 Linear Genetic Programming 



In linear GP programs are linear sequences of instructions, such as the one 
in Figure |7d~| The number of instructions can be fixed, meaning that every 
program in the population has the same length, or variable, meaning that 
different individuals can be of different sizes. In the following sections we 
discuss reasons for using linear GP (Section 7.1.1| ). We then provide more 
details on the different flavours of linear GP (Section 7.1.2). Finally, we 



describe briefly the main genetic operations for linear GP (Section 7.1.3). 



7.1.1 Motivations 

There are two different reasons for trying linear GP. Firstly, almost all 
computer architectures represent computer programs in a linear fashion with 



Instruction 1 


Instruction 2 


— 


Instruction N 



Figure 7.1: Typical linear GP representation for programs. 
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Output 


Arg 1 


Opcode 


R0..R7 


R0..R7 


+ - * / 



Arg 2 

0...127 

or 

R0..R7 



Figure 7.2: Format of a linear GP engine instruction. RO to R7 refer to 
CPU’s registers. 



neighbouring instructions being normally executed in consecutive time steps 
(albeit control structures, jumps and loops may change the execution order). 
So, why not evolve linear programs? This led Banzhaf (19931, Perkis ( 1994| ) 
as well as Openshaw and Turton ( 1994 1 to try linear GP. 

Secondly, computers do not naturally run tree-shaped programs, so in- 
terpreters or compilers have to be used as part of tree-based GP. On the 
contrary, by evolving the binary bit patterns actually obeyed by the com- 
puter, linear GP can avoid the use of this computationally expensive ma- 
chinery and GP can run several orders of magnitude faster. This desire for 



speed drove Nordin ( 1994 1 , Nordin et al. ( 1999 1 , Crepeau ( 1995 ) and Eklund 

pool. 



7.1.2 Linear GP Representations 

As discussed in Section |2.1| it is possible to use a linear representation in 
tree-based GP. When doing so, however, the linear structures are simply 
flattened representations of the trees. Thus, in the linear structure one can 
still identify the root node, its children, and the rest of the tree structure. 
In such a system, instructions typically communicate via their arguments. 

The semantics of linear GP are quite different, however. In linear GP, in- 
structions typically read their input (s) from one or more registers or memory 
locations and store the results of their calculations in a register. For exam- 
ple, they might take the form shown in Figure [772} This means instructions 
in linear GP all have equivalent roles and communicate only via registers 
or memory. In linear GP there is no equivalent of the distinction between 
functions and terminals inherent in tree-based GP. Also, in the absence of 
loops or branches, the position of the instructions determines the order of 
their execution. Typically, this not the case for the structures representing 

trees Q 

The instructions in linear GP may or may not represent executable ma- 
chine code. That is, there are essentially two flavour of linear GP: machine 
code GP , where each instruction is directly executable by the CPU, and 

1 Typically, in tree-based-GP the nodes are visited (but not executed ) from left to right 
in depth-first order. Primitives are only executed, however, when their arguments have 
been evaluated. So, the root node is the first node visited but the last executed. 
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interpreted linear GP , where each instruction is executable by some higher- 
level virtual machine (typically written in an efficient language such as C 
or C++). When the instructions are actual machine code, then the order 
of the elements of the representation shown in Figure [7+] is determined by 
the particular computer architecture used, and the corresponding data must 
be packed into bit fields of appropriate sizes. The overhead of packing and 
unpacking data can be avoided, however, when one is using virtual machine 
instructions since then the designer of a GP system has complete freedom 
as to how the virtual machine will interpret its instructions. 

If the goal is execution speed, then the evolved code should be machine 
code for a real computer rather than some higher level language or virtual- 



machine code. This is why 


Nordin 


(1994 


) started by evolving machine code 


for SUN computers and 


Crepeau 


1995 


targeted the Z80. The linear GP 


of 


Leung, Lee, and Cheang 


2002 


) was designed for novel hardware, but much 



of the GP development had to be run in simulation whilst the hardware itself 
was under development. 

The Sun SPARC has a simple 32-bit RISC architecture which eases 
designing genetic operations which manipulate its machine code. |Nordin| 
(1997) wrapped each machine code GP individual (which was a sequence of 
machine instructions) inside a C function. Each of the GP program’s inputs 
was copied from one of the C function’s arguments into one of the machine 
registers. As well as the registers used for inputs]^] a small number (e.g., 
2-4) of other registers are used for scratch memory to store partial results of 
intermediate calculations. Finally, the GP simply leaves its answer in one of 
the registers. The external framework uses this as the C function’s return 
value. 

Since Unix was ported onto the x86, Intel’s complex instruction set, 
which was already standard with Windows-based PCs, has had almost com- 
plete dominance. Seeing this, Nordin ported his Sun RISC linear GP system 
onto Intel’s CISC. Various changes were made to the genetic operations 
which ensure that the initial random programs are made only of legal In- 
tel machine code and that mutation operations, which act inside the x86’s 
32-bit word, respect the x86’s complex sub-fields. Since the x86 has instruc- 
tions of different lengths, special care has to be taken when altering them. 
Typically, several short instructions are packed into each 4-byte word. If 
there are any bytes left over, they are filled with no-operation codes. In 
this way, best use is made of the available space, without instructions cross- 
ing 32-bit boundaries. Nordin’s work led to Discipulus (Foster, 2001), 



which has been used in applications ranging from bioinformatics (Vukusic, 



Grellscheid, and Wiehe 2007) to robotics (Langdon and Nordin 20011 and 



2 Anyone using a register-based GP (linear or tree-based) should consider write- 
protecting the input registers to prevent the inputs from being overwritten. Otherwise 
evolved programs (especially in the early generations) are prone to writing over their 
inputs before they’ve had a chance to use them in any constructive way. 
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bomb disposal (Deschaine, Hoover, Skibinski, Patel, Francone, Nordin, and 



Ades, 2002). 



Note that execution speed is not the only reason for using linear GP. Al- 
though interpreted linear programs are slower than machine-code programs, 
an interpreted linear GP system can be more efficient than an interpreted 
tree-based systems. Also, a simple linear structure lends itself to rapid anal- 
ysis. Brameier and Banzhaf (2001) showed a linear program can be easily 



scanned and any “dead code” it contains can be removed. In some ways the 



don 1999b 


2002a|b 


2003a 


Langdon and Banzhaf 2005). For example, 


Langdon 1 


2006 


); Langdon and Poli 


( 2006 ? 2008 ) have used (in simulation) 



two simple architectures, called T 7 and T8), for several large scale experi- 
ments and for the mathematical analysis of Turing complete GP. For these 
reasons, it makes sense to consider linear virtual-machine code GP even 
when using languages like Java that are typically run on virtual machines; 



one can in fact use a virtual machine (like (jLeung et al 


2002)) to inter- 


pret the evolved byte code 1 


Harvey, Foster, and Frincke 


1999 


Lukschandl, 


Borgvall, Nohle, Nordahl, and Nordin 


2000 


)• 



7.1.3 Linear GP Operators 



The typical crossover and mutation operators for linear GP ignore the details 
of the machine code of the computer being used. For example, crossover may 
choose randomly two crossover points in each parent and swaps the code 
between them. Since the crossed over fragments are typically of different 
lengths, such a crossover may change the programs’ lengths, cf. Figure [773] 
Since computer machine code is organised into 32- or 64-bit words, the 
crossover points occur only at the boundaries between words. Therefore, 
a whole number of words, containing a whole number of instructions are 
typically swapped over. Similarly, mutation operations normally respect 
word boundaries and generate legal machine code. However, linear GP lends 
itself to a variety of other genetic operations. For example, F igure |7~4| shows 
homologous crossover. Many other crossover and mutation operations are 



possible (Langdon and Banzhaf 2005). 



In a compiling genetic programming system (Banzhaf, Francone, and 
Nordin 1996 1 the mutation operator acts on machine code instructions 



and is constrained to “ensure that only instructions in the function set are 
generated and that the register and constant values are within predefined 
ranges allowed in the experimental set up”. On some classification prob- 



lems Banzhaf et al. (1996) reported that performance was best when using 



crossover and mutation in equal proportions. They suggested that this was 
due to the GP population creating “introns” (blocks of code that does not 
affect fitness) in response to the crossover operator, and that these were sub- 
sequently converted into useful genetic material by their mutation operator. 
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Parent 1 
Parent 2 [ 



Offspring 



Figure 7.3: Typical linear GP crossover. Two instructions are randomly 
chosen in each parent (top two genomes) as cut points. The code fragment 
excised from the first parent is then replaced with the code fragment excised 
from the second to generate the child (lower chromosome). 

Parent 1 i 



Parent 2 



Offspring 1 ) 

Offspring 2 [ 



Figure 7.4: Discipulus’s “homologous” crossover (Foster 2001 |Francone 



et al.[ 1999[ Nordin et al. 1999 ). Crossover is performed on two parents (top 



two programs) to yield two offspring (bottom). The two crossover points are 
the same in both parents, so the exised code does not change its position 
relative to the start of the program (left edge), and the child programs have 
the same lengths as their parents. Homologous crossover is often combined 
with a small amount of normal two point crossover (Figure [7.3[ ) to introduce 
length changes into the GP population. 



7.2 Graph-Based Genetic Programming 

Trees are special types of graphs. So it is natural to ask what would happen 
if one extended GP so as to be able to evolve graph-like programs. Starting 
from the mid 1990s, researchers have proposed several extensions of GP that 
do just that, albeit in different ways. 



7.2.1 Parallel Distributed GP 



Poli (1996a 1999a) proposed parallel distributed GP (PDGP), a form of GP 



that is suitable for the evolution of highly parallel programs which effec- 
tively reuse partial results. Programs are represented in PDGP as graphs 



66 



7 Linear and Graph Genetic Programming 





(a) (b) 



Figure 7.5: A sample tree where the same subtree is used twice (a) and 
the corresponding graph-based representation of the same program (b) . The 
graph representation may be more efficient since it makes it possible to avoid 
the repeated evaluation of the same subtree. 



with nodes representing functions and terminals. Edges represent both con- 
trol flow and data flow. The possible efficiency gains obtained by a graph 
representation are illustrated in Figure |7.5| 

In the simplest form of PDGP edges are directed and unlabelled, in 
which case PDGP is a generalisation of standard GP. However, more com- 
plex representations can be used, which allow the evolution of: programs, 
including standard tree-like programs, logic networks, neural networks, re- 
current transition networks and finite state automata. This can be achieved 
by extending the representation by associating labels with the edges of the 
program graph. In addition to the function and terminal sets, this form of 
PDGP requires the definition of a link set. The labels on the links depend 
on what is to be evolved. For example, in neural networks, the link labels 
are numerical constants for the neural network weights. In a finite state au- 
tomaton, the edges are labelled with the input symbols that determine the 
FSA’s state transitions. It is even possible for the labels to be automatically 
defined edges , which play a role similar to ADFs (Section 6.1.1) by invoking 
other PDGP graphs. 

In PDGP, programs are manipulated by special crossover and mutation 
operators which guarantee the syntactic correctness of the offspring. Each 
node occupies a position in a regular grid. The genetic operators act by 
moving, copying or randomly generating sub-regions of the grid. For this 
reason PDGP search operators are very efficient. 

PDGP programs can be executed according to different policies depend- 
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ing on whether instructions with side effects are used or not. If there are 
no side effects, running a PDGP program can be seen as a propagation of 
the input values from the bottom to the top of the program’s graph (as in 
a feed- forward artificial neural network or data flow parallel computer) . 

7.2.2 Parallel Algorithm Discovery and Orchestration 

called parallel algorithm discovery and orchestration (PADO), 

I used a combination of GP and linear discrimination to obtain 
parallel classification programs for signals and images. 

PADO programs include three parts: a main loop, some ADFs and an 
indexed memory. The main loop is repeatedly executed for a fixed amount 
of time. When the time is up, PADO programs are forced to halt by some 
external control structure. The output of a program is the weighted average 
of the outputs produced at each iteration of the loop. The weights are 
proportional to the iteration count, so that more recent outputs count more. 

The main loop and the ADFs in PADO are structured as arbitrary di- 
rected graphs of nodes. Each node can have multiple outgoing arcs that 
indicate possible flows of control. Each node has two main parts: an ac- 
tion and a branch-decision. Each program has an argument stack and all 
PADO actions pop their inputs from this argument stack and push their re- 
sult back onto the argument stack. The actions are drawn from a primitive 
set including the standard algebraic operations, minimum, maximum, nega- 
tion, read from indexed memory, write to indexed memory, deterministic 
and non-deterministic branching instructions, and primitives related to the 
task of classifying images. The evaluation of PADO programs starts from a 
designated node. After the execution of each node, control is passed to the 
node chosen by the branch-decision function of the current node. 



In a system 
TMct] (11996 



7.2.3 Cartesian GP 



In Miller’s Cartesian GP (Miller 1999 Miller and Smith) 20061, programs 
are represented by linear chromosomes containing integers. These are di- 
vided into groups of three or four. Each group corresponds to a position in 
a 2-D array. One integer in each group defines the primitive (e.g., an AND 
gate) at that location in the array. Other integers in the group define the 
locations (coordinates) in the genome from which the inputs for that primi- 
tive should be drawn. Each primitive does not itself define where its output 
is used; this is done by later primitives. A primitive’s output may be used 
more than once, or indeed not used at all, depending on the way in which 
the other functions’ inputs are specified. Thus, Cartesian GP’s chromo- 
somes represent graph-like programs, which is very similar to PDGP. The 
main difference between the two systems is that Cartesian GP operators act 
at the level of the linear chromosome, while in PDGP they act directly on 
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the graph. Also, traditionally Cartesian GP has always used mutation as 
its main search operation, while PDGP used both crossover and mutation. 
However, recently a new crossover has been proposed for Cartesian GP that 



provides faster convergence (Clegg, Walker, and Miller 2007). 



7.2.4 Evolving Parallel Programs using Indirect 
Encodings 

The graph-based systems discussed above use representations which directly 
encode parallel programs. However, it is also possible to use non-graph- 
based GP to evolve parallel programs. For example, Bennett (1996) used a 
parallel virtual machine in which several standard tree-like programs (called 
“agents”) would have their nodes executed in parallel. He included a two- 
stage mechanism which simulated parallelism of sensing actions and simple 



conflict resolution (prioritisation) for actions with side effects. Andre, Ben- 
nett, and Koza ( 1996[ ) used GP to discover rules for cellular automata, a 
highly parallel computational architecture, which could solve large majority- 
classification problems. In conjunction with an interpreter implementing a 
parallel virtual machine, GP can also be used to translate sequential pro- 



grams into parallel ones (Walsh and Ryan 
programs. 



1996), or to develop parallel 



Chapter 8 

Probabilistic Genetic 
Programming 



Genetic programming typically uses an evolutionary algorithm as its main 
search engine. However, this is not the only option. The use of simulated 
annealing and hill climbing to search the space of computer programs was 
mentioned in Section |5.4| This chapter considers recent work where the ex- 
ploration is performed by population-based search algorithms which adapt 
and sample probability distributions instead of using traditional genetic op- 
erators. 

Sampling from a probability distribution means generating random val- 
ues whose statistical properties match those of the given distribution. For 
example, if one sampled a univariate Gaussian distribution, one would ex- 
pect the resulting values to tend to have mean and standard deviation sim- 
ilar to the mean and standard deviation of the Gaussian. The notion of 
sampling can be extended to much more complex distributions involving 
multiple variables. Furthermore, discrete as well as continuous variables are 
possible. 



8.1 Estimation of Distribution Algorithms 

Estimation of distribution algorithms (EDAs) are powerful population-based 
searchers where the variation operations traditionally implemented via 
crossover and mutation in EAs are replaced by the process of random sam- 
pling from a probability distribution. The distribution is modified genera- 
tion after generation, using information obtained from the fitter individuals 
in the population. The objective of these changes in the distribution is to 
increase the probability of generating individuals with high fitness. 
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Different EDAs use different models for the probability distribution that 
controls the sampling (see ( |Larranaga 2002 Larrahaga and Lozano 2002 1 
for more information). For example, population-based incremental learning 
(PBIL) ( Baluja and Caruana 1995 ) and the uniform multivariate distribu- 
tion algorithm (UMDA) (Miihlenbein and Mahnig 1999a|b I assume that 



each variable is independent of the other variables. Consequently, these al- 
gorithms need to store and adjust only a linear array of probabilities, one for 
each variable. This works well for problems with weak interactions between 
variables. Since no relationship between the variables is stored or learned, 
however, PBIL and UMDA may have difficulties solving problems where the 
interactions between variables are significant. 



Naturally, higher order models are possible. For example, the MIMIC 
algorithm of de Bonet, Isbell, and Viola (1997) uses second-order statis- 
tics. It is also possible to use flexible models where interactions of dif- 
ferent orders are captured. The Bayesian optimisation algorithm (BOA) 



(Pelikan, Goldberg, and Cantu-Paz 1999) uses baysian networks to rep- 



resent generic sampling distributions, while the extended compact genetic 
algorithm (eCGA) (Harik 1999) clusters genes into groups where the genes 
in each group are assumed to be linked but groups are considered inde- 
pendent. The sampling distribution is then taken to be the product of the 
distributions modelling the groups. 



EDAs have been very successful. However, they are often unable to rep- 
resent both the overall structure of the distribution and its local details, 
typically being more successful at the former. This is because EDAs rep- 
resent the sampling distribution using models with an, inevitably, limited 
number of degrees of freedom. For example, suppose the optimal sampling 
distribution has multiple peaks, corresponding to different local optima, sep- 
arated by large unfit areas. Then, an EDA can either decide to represent 
only one peak, or to represent all of them together with the unfit areas. If 
the EDA chooses the wrong local peak this may lead to it getting stuck and 
not finding the global optimum. Conversely if it takes a wider view, this 
leads to wasting many trials sampling irrelevant poor solutions. 



Consider, for example, a scenario where there are five binary variables, 
aq, X 2 , £3, X 4 and X 5 , and two promising regions: one near the string of all 
zeros, i.e., (aq, X 2 , £3, £4, £5) = (0,0,0, 0,0), and the other near the string 
of all ones, i.e., (aq, £2, £3,2:4, £5) = (1,1, 1,1,1). One option for a (simple) 
EDA is to focus on one of the two regions, e.g., setting the variables 2 q 
to 0 with high probability (say, 90%). This, however, fails to explore the 
other region, and risks missing the global optimum. The other option is to 
maintain both regions as possibilities by setting all the probabilities to 50%, 
i.e., each of the variables £* is as likely to be 0 as 1. These probabilities will 
generate samples in both of the promising regions. For example, the strings 
(0,0, 0,0,0) and (1,1, 1,1,1) will each be generated with a 3.125% proba- 
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bility. Also, simple calculations show that 31.25% of individuals generated 
by this distribution will be at Hamming distance 1 from either (0,0, 0,0,0) 
or (1,1,1,1,1)Q So, both optimal regions are sampled reasonably often. 
However, it is clear that the majority (62.5%) of samples will be allocated 
to less promising regions, where the Hamming distance will be 2 or 3 from 
both (0,0,0, 0,0) and (1,1, 1,1,1). This is a significant concern, which is 
why recently EDAs have often been used in combination with local search 
(e.g., see ( Zhang, Sun, and Tsang[ 20051). 

There have been several applications of probabilistic model-based evolu- 
tion (EDA-style) in the areas of tree-based and linear GP. We review them 
in the rest of this chapter. 



8.2 Pure EDA GP 



The first EDA-type GP system was effectively an extension of PBIL to trees 



called probabilistic incremental program evolution (PIPE) (Salustowicz and 
Schmidhube r , 1997} Salustowicz, Wiering, and Schmidhuber} 1998 Salus- 



towicz and Schmidhuber 1999). In PIPE, the population is replaced by a 



hierarchy of probability tables organised into a tree (such as the one in Fig- 
ure 8.1 1 . Each table represents the probability that a particular primitive 



will be chosen at that specific location in a newly generated program tree. 
At each generation a population of programs is created based on the current 
tree of probability tables. The generation of a program begins by choosing a 
root node based on the probabilities in the root table, and then continuing 
down the hierarchy of probability tables until all branches of the tree are 
complete (i.e., a terminal has been chosen on each branch). The fitness of 
these new programs is computed, and the probability hierarchy is updated 
on the basis of these fitnesses, so as to make the generation of above-average 
fitness programs more likely in the next generation. 

A positive feature of PIPE is that the probability of choosing a particular 
primitive can vary with its depth (and, more generally, position) in the tree. 
This makes it possible, for example, for terminals to become increasingly 
probable as a node’s depth increases. A limitation of PIPE, however, is that 
the primitives forming a tree are chosen independently from each other 0so it 
is impossible for PIPE to capture dependencies among primitives. Another 
limitation is that the maximum size of the generated trees is constrained 
by the size of the tree of probability tables. |Ondas, Pelikan, and Sastry| 
(2005) compared the performance of PIPE and standard tree-based GP on 



The Hamming distance between two strings (whether binary or not) is the number 
of positions where the two strings differ. 

2 There is a weak form of dependency, in that there can be a primitive in a particular 
position only if the primitive just above it is a function. The choice of this parent primitive 
does not, however, influence the choice of the child primitive. 
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a small set of artificial problems, including a GP version of one-mas^] and a 
GP version of the fully deceptive trap functioi^j Results suggest that PIPE 
and standard GP have similar scaling properties, but that standard subtree 
crossover inherently links neighbouring nodes whereas PIPE does not. 



Sastry and Goldberg ( 2003 ) proposed an algorithm called extended com- 



pact GP (eCGP) which effectively extends the eCGA algorithm (Harik 



1999) to trees. Like PIPE, eCGA assumes that all trees will fit within a 



fixed maximal tree. It partitions the nodes in this maximal tree into groups. 
The nodes in a group are assumed to be linked and their co-occurrence is 
modelled by a full joint distribution table. As with eCGA, the probability 
of generating a particular tree is given by the product of the probabilities of 
generating each group of nodes using the groups’ joint distributions. An ad- 
vantage of this system is that, unlike PIPE, it captures dependencies among 
primitives. However, to the best of our knowledge this system has only been 
tested on the two artificial problems used by Ondas et al. (2005) to compare 
PIPE and GP. Consequently its behaviour on more typical GP problems is 
unknown. 



Yanai and Iba ( 2003 ) proposed an EDA called estimation of distribution 



programming (EDP) which, in principle, can capture complex dependencies 
between a node in a program tree and the nodes directly above it or to its 
left j^] As with eCGP and PIPE, programs are tree-like and are assumed 
to always fit within an ideal maximal full tree. A conditional probability 
table is necessary for each node in such a tree to capture the dependencies. 
To keep the size of data structures manageable, only pairwise dependencies 
between each node and its parent were stored and used. EDP was tested on 
both the Max problem (Gathercole and Ross 19951 and the 6-multiplexer. 



A later hybrid algorithm combining EDP and GP was proposed ( Yanai and 



Iba 2004) which showed promise when tested on three symbolic regression 



problems. 

An EDA based on a hierarchical BOA was used as the main mechanism 
to generate new individuals in meta- optimising semantic evolutionary search 
(MOSES) (Looks 2007). This combined multiple strategies and used seman- 
tics to restrict and direct the search. BOA was also used to evolve programs 
in (Looks, Goertzel, and Pennachin 2005) using a specialised representation 
for trees. 



3 One-max is a simple GA test problem where the goal is to maximise the number of 
l’s in a binary string. 

4 Trap functions are fitness functions that have a gradual slope leading to a sub-optimal 
local maxima, and a steep valley between that local maxima and the global optima. They 
therefore tend to “trap” populations on the local maxima 

5 In the general case a node can depend on the choices of any of the nodes that have 
already been chosen. Since the tree is constructed in a depth-first, left-to-right fashion, 
it can depend on any nodes that are its direct ancestors, or any nodes that are to its left 
in the tree. In practice, however, EDP only tracked the conditional probability of a node 
on its parent. 




Figure 8.1: Example of probability tree used for the generation of programs 
in PIPE. New program trees are created starting from the root node at the 
top and moving through the hierarchy. Each node in an offspring tree is 
selected from the left hand side of the corresponding table with probability 
given by the right hand side. Each branch of the tree continues to expand 
until either the tree of probability tables is exhausted or a leaf (e.g., R) is 
selected. 
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Recently an EDA-based system called N-gram GP (Poli and McPhee 



2008a) has been proposed that allows the evolution of linear GP programs. 



To some extent, N-gram GP overcomes the common difficulties EDAs have 
in performing local search when using a centralised population model. The 
N-gram GP system is able to capture both the local and the global features 
of the optimal sampling distribution, albeit at the cost of imposing certain 
other constraints. This makes it possible, for example, for the search to focus 
on the neighbourhood of a small number of individuals without the need to 
choose among them. Tests on polynomial symbolic regression problems and 
the lawnmower problem were very encouraging. 



8.3 Mixing Grammars and Probabilities 



A variety of other systems have been proposed which combine the use of 
grammars and probabilities. We mention only a few here; a more extended 



review of these is available in (Shan, McKay, Essam, and Abbass 2006). 



Ratle and Sebag (20011 used a stochastic context-free grammar to gen- 
erate program trees. The probability of applying each rewrite rule was 
adapted using a standard EDA approach so as to increase the likelihood of 
using successful rules. The system could also be run in a mode where rule 
probabilities depended upon the depth of the non-terminal symbol to which 
a rewrite rule was applied, thereby providing a higher degree of flexibility. 

The approach taken in program evolution with explicit learning (PEEL) 



(Shan, McKay, Abbass, and Essam, 2003) was slightly more general. PEEL 



used a probabilistic L-system where rewrite rules were both depth- and 
location-dependent. The probabilities with which rules were applied were 
adapted by an ant colony optimisation (ACO) algorithm (Dorigo and 



Stiitzle 20041. Another feature of PEEL was that the L-system’s rules 
could be automatically refined via splitting and specialisation. 

Other programming systems based on probabilistic grammars which are 



optimised via ant systems include ant-TAG ( 


Abbass, Hoai, and McKay 


2002| |Shan, Abbass, McKay, and Essam 


2002 


1, which uses a tree-adjunct 



grammar as its main representation, and generalised ant programming 



(GAP) (Keber and Schuster 2002), which is based on a context-free gram- 



mar. Other systems which learn and use probabilistic grammars include 



grammar model based program evolution (GMPE) (Shan, McKay, Baxter, 



Pozo 



2005). 



Abbass, Essam, and Hoai 20041, the system described in ( |Bosman and de 
Jong] 2004 a|b|) and Baysian automatic programming (BAP)~|Regolin and 



Chapter 9 



Multi-objective 
Genetic Programming 



The area of multi-objective GP (MO GP) has been very active in the last 
decade. In a multi-objective optimisation (MOO) problem, one optimises 
with respect to multiple goals or fitness functions fufi,--- The task of a 
MOO algorithm is to find solutions that are optimal, or at least acceptable, 
according to all the criteria simultaneously. 

In most cases changing an algorithm from single-objective to multi- 
objective requires some alteration in the way selection is performed. This is 
how many MO GP systems deal with multiple objectives. However, there 
are other options. We review the main techniques in the following sections. 

The complexity of evolved solutions is one of the most difficult things 
to control in evolutionary systems such as GP, where the size and shape of 
the evolved solutions is under the control of evolution. In some cases, for 
example, the size of the evolved solutions may grow rapidly, as if evolution 
was actively promoting it, without any clear benefit in terms of fitness. We 
will provide a detailed discussion of this phenomenon, which is know as bloat , 
and a variety of counter measures for it in Section |ll.3| However, in this 
chapter we will review work where the size of evolved solutions has been 
used as an additional objective in multi-objective GP systems. Of course, 
we will also describe work where other objectives were used. 

9.1 Combining Multiple Objectives into a 
Scalar Fitness Function 

When given multiple fitness functions, it is natural to think of combining 
them in some way so as to produce an aggregate scalar fitness function. For 
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example, one could use a linear combination of the form / = j>A Wifi , where 
the parameters W\, w 2 , . . . are suitable constants. A MOO problem can then 
be solved by using any single-objective optimisation technique with / as a 
fitness function. This method has been used frequently in GP to control 
bloat. By combining program fitness and program size to form a parsimo- 
nious fitness function one can evolve solutions that satisfy both objectives 



(see Koza (f 


.992); Zhang and Miihlenbein 


(1993 1995) 


Zhang, Ohm, and 


Miihlenbein 


(1997) and Section 


11.3.21. 



A semi-linear aggregation of fitness and speed was used in (Langdon 
and Poli |1998b I to improve the performance of GP on the Santa Fe Trail 



Ant problem. There, a threshold was used to limit the impact of speed to 
avoid providing an excessive bias towards ants that were fast but could not 
complete the trail. 

A fitness measure which linearly combines two related objectives, the 
sum of squared errors and the number of hits (a hit is a fitness case in which 



the error falls below a pre-defined threshold), was used in (Langdon, Barrett, 



and Buxton 2003 ) to predict biochemical interactions in drug discovery. 



Zhang and Bhowan (20041 used a MO GP approach for object detection. 
Their fitness function was a linear combination of the detection rate (the 
percentage of small objects correctly reported), the false alarm rate (the 
percentage of non-objects incorrectly reported as objects), and the false 
alarm area (the number of false alarm pixels which were not object centres 
but were incorrectly reported as object centres). 



O’Reilly and Hemberg (20071 used six objectives for the evolution of 



L-systems which developed into 3-D surfaces in response to a simulated 
environment. The objectives included the size of the surface, its smoothness, 
its symmetry, its undulation, the degree of subdivision of the surface, and 
the softness of its boundaries. 



(Koza, Jones, Keane, and Streeter 



2004) used 16 different objectives 
in the process of designing analogue electrical circuits. In the case of an 
amplifier circuit these included: the lOdB initial gain, the supply current, the 
offset voltage, the gain ratio, the output swing, the variable load resistance 
signal output, etc. These objectives were combined in a complex heuristic 
way into a scalar fitness measure. In particular, objectives were divided 
into groups and many objectives were treated as penalties that were applied 
to the main fitness components only if they are outside certain acceptable 
tolerances. 



9.2 Keeping the Objectives Separate 

Since selection does not depend upon how the members of the population 
are represented, the MOO techniques developed for other evolutionary al- 
gorithms can be easily adapted to GP. 



9.2 Keeping the Objectives Separate 
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Figure 9.1: Two-dimensional example of Pareto optimality and the Pareto 
front, where the goal is to maximise along both the x and y axes. Solutions 
A and B do not dominate each other. However, solution B is dominated by 
solution 2. (Adapted from (Langdon 1998).) 



The main idea in MOO is the notion of Pareto dominance. Given a set 
of objectives, a solution is said to Pareto dominate another if the first is not 
inferior to the second in all objectives, and, additionally, there is at least one 
objective where it is better. This notion can lead to a partial order , where 
there is no longer a strict linear ordering of solutions. In Figure |9.1| for 
example, individual A dominates (is better than) individual B along the y 
axis, but B dominates A along the x axis. Thus there is no simple ordering 
between then. The individual marked ‘2’, however dominates B on both 
axes and would thus be considered strictly better than B. 

In this case the goal of the search algorithm becomes the identification of 
a set of solutions which are non-dominated by any others. Ideally, one would 
want to find the Pareto front , i.e., the set of all non-dominated solutions in 
the search space. However, this is often unrealistic, as the size of the Pareto 
front is often limited only by the precision of the problem representation. If 
x and y in Figure [9TT| are real-valued, for example, and the Pareto front is 
a continuous curve, then it contains an infinite number of points, making a 
complete enumeration impossible. 



9.2.1 Multi-objective Bloat and Complexity Control 



Rodriguez-Vazquez, Fonseca, and Fleming ([1997) performed non-linear sys- 



tem identification using a MO GP system, where individuals were selected 
based on the Pareto dominance idea. The two objectives used were fitness 
and model complexity. In each generation individuals were ranked based on 
how many other individuals dominated them, and fitness was based on their 
rank. To better cover the Pareto front, niching via fitness sharing (Gold- 



berg 1989) was also performed. Preference information was also included 
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to focus the selection procedure towards specific regions of the Pareto front. 
Hinchliffe, Willis, and Tham (1998) applied similar ideas to evolve parsimo- 



nious and accurate models of chemical processes using MO GP. |Langdon| 
and Nordin (2000) applied Pareto tournaments to obtain compact solutions 



in programmatic image compression, two machine learning benchmark prob- 
lems and a consumer profiling task. [Nicolotti, Gillet, Fleming, and Green| 
(2002) used multi-objective GP to evolve quantitative structure-activity re- 



lationship models in chemistry; objectives included model fitting, the total 
number of terms and the occurrence of non-linear terms. 



Ekart and Nemeth ( 2001 ) tried to control bloat using a variant of Pareto 
tournament selection where an individual is selected if it is not dominated 
by a set of randomly chosen individuals. If the test fails, another individual 
is picked from the population, until one that is non-dominated is found. 
In order to prevent very small individuals from taking over the population 
in the early generations of runs, the Pareto criterion was modified so as 
to consider as non-dominated solutions also those that were only slightly 
bigger, provided their fitness was not worse. 





Bleuler, Brack, Thiele, and Zitzler 


(2001 ) suggested using the well-known 


multi-objective optimise!' SPEA2 ( 


Zitzler, Laumanns, and Thiele 


2001) to 


reduce bloat. 


de Jong, Watson, and Pollack 


(2001) and 


de Jong and Pollack 



(2003) proposed using a multi-objective approach to promote diversity and 
reduce bloat, stressing that without diversity enhancement (given by modern 
MOO methods) searches can easily converge to solutions that are too small 
to solve a problem. Tests with even parity and other problems were very 



encouraging. Badran and Rockett (2007) argued in favour of using mutation 



to prevent the population from collapsing onto single-node individuals when 
using a multi-objective GP. 

As well as directly fighting bloat, MO GP can also be used to simplify 
solution trees. After GP has found a suitable (but large) model, for example, 
one can continue the evolutionary process, changing the fitness function to 



include a second objective that the model be as small as possible (Langdon 



1998). GP can then trim the trees while ensuring that the simplified program 



still fits the training data. 



9.2.2 Other Objectives 

Although much of the use of MOO techniques in GP has been aimed at 
controlling bloat, there are also genuinely MOO applications. 



For example, Langdon (1998) made extensive use of Pareto dominance 
ranking to evolve different types of data structures. Up to six different 
criteria were used to indicate to what degree an evolved data structure met 
the requirements of the target data structure. The criteria were used in 
Pareto-type tournament selection, where, unlike in other systems, a second 
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round of comparisons with the rest of the population was used as a tie 
breaker. The method successfully evolved queues, lists, and circular lists. 



Langdon and Poli (1998b) used Pareto selection with two objectives, 



fitness and speed, to improve the performance of GP on the Santa Fe Trail 
Ant problem. Ross and Zhu (2004) used MO GP with different variants of 
Pareto selection to evolve 2-D textures. The objectives were feature tests 
that were used during fitness evaluation to rate how closely a candidate 
texture matched visual characteristics of a target texture image. Dimopoulos 
(2005) used MO GP to identify the Pareto set for a cell-formation problem 
related to the design of a cellular manufacturing production system. The 
objectives included the minimisation of total intercell part movement, and 
the minimisation of within-cell load variation. 



Rossi, Liberali, and Tettamanzi (2001) used MO GP in electronic design 



automation to evolve VHDL code. The objectives used were the suitability 
of the filter transfer function and the transition activity of digital blocks. 



Cordon, Herrera-Viedma, and Luque (20021 used Pareto-dominance-based 



GP to learn Boolean queries in information retrieval systems. They used two 
objectives: precision (the ratio between the relevant documents retrieved in 
response to a query and the total number of documents retrieved) and recall 
(the ratio between the relevant documents retrieved and the total number 
of documents relevant to the query in the database) . 



Barlow (2004) used a GP extension of the well-known NSGA-II MOO 



algorithm (Deb, Agrawal, Pratap, and Meyarivan 2000) for the evolution of 
autonomous navigation controllers for unmanned aerial vehicles. Their task 
was locating radar stations, and all work was done using simulators. Four 
objectives were used: the normalised distance from the emitter, the circling 
distance from the emitter, the stability of the flight, and the efficiency of 
the flight. 



Araujo ( 2006 ) used MO GP for the joint solution of the tasks of statistical 



parsing and tagging of natural language. Their results suggest that solving 
these tasks jointly led to better results than approaching them individually. 



Han, Zhou, and Wang (20061 used a MO GP approach for the identi- 



fication of chaotic systems where the objectives included chaotic invariants 
obtained by chaotic time series analysis as well, as the complexity and per- 
formance of the models. 



Khan (2006) used MO GP to evolve digital watermarking programs. The 



objectives were robustness in the decoding stage, and imperceptibility by the 
human visual system. Khan and Mirza (2007) added a third objective aimed 
at increasing the strength of the watermark in relation to attacks. 



Kotanchek, Smits, and Vladislavleva (2006) compared different flavours 



of Pareto-based GP systems in the symbolic regression of industrial data. 



Weise and Geihs (2006) used MO GP to evolve protocols in sensor networks. 



The goal was to identify one node on a network to act as a communication 
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relay. The following objectives were used: the number of nodes that know 
the designated node after a given amount of time, the size of the protocol 
code, its memory requirements, and a transmission count. 



Agapitos, Togelius, and Lucas (2007) used MO GP to encourage the 



effective use of state variables in the evolution of controllers for toy car 
racing. Three different objectives were used: the ratio of the number of 
variables used within a program to the number of variables offered for use by 
the primitive language, the ratio of the number of variables being set within 
the program to the number of variables being accessed, and the average 
positional distance between memory setting instructions and corresponding 
memory reading instructions. 

When two or three objectives need to be simultaneously optimised, the 
Pareto front produced by an algorithm is often easy to visualise. When 
more than three objectives are optimised, however, it becomes difficult to 
directly visualise the set of non-dominated solutions. Valdes and Bartoni 
(2006) proposed using GP to identify similarity mappings between high- 
dimensional Pareto fronts and 3-D space, and then use virtual reality to 
visualise the result. 



9.2.3 Non-Pareto Criteria 



Pareto dominance is not the only way to deal with multiple objectives with- 
out aggregating them into a scalar fitness function. 

Schmiedle, Drechsler, Grosse, and Drechsler (2001|) compared GP with 



four different MOO selection methods on the identification of binary deci- 
sion diagrams. Linear weighting of the objectives was compared against: a) 
Pareto dominance; b) a weaker form of Pareto dominance where a solution 
is preferred to another if the number of objectives where the first is superior 
to the second is bigger than the number of objectives where the opposite is 
true; c) lexicographic ordering (where objectives are ordered based on the 
user’s preference); and d) a new method based on priorities. The lexico- 
graphic parsimony pressure method proposed in (Luke and Panaitj 2002 



Ryan 1994|) is in fact a form of MOO with lexicographic ordering (in which 



shorter programs are preferred to longer ones whenever their fitness is the 
same or sufficiently similar). An approach which combines Pareto domi- 



nance and lexicographic ordering was proposed in (Panait and Luke 20041. 



9.3 Multiple Objectives via Dynamic and 
Staged Fitness Functions 

Often it is possible to rank multiple objectives based on some notion of 
importance. In these cases, it is possible to use dynamic fitness functions 
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which initially guide GP towards solutions that maximise the main objec- 
tive. When enough of the population has reached reasonable levels in that 
objective, the fitness function is modified so as to guide the population to- 
wards the optimisation of a second objective. In principle this process can 
be iterated for multiple objectives. Of course, care needs to be taken to 
ensure that the functionality reached with a set of previous fitness measures 
is not wiped by the search for the optima of a later fitness function. This 
can be avoided by making sure each new fitness function somehow includes 
all the previous ones. For example, the fitness based on the new objectives 
can be added to the pre-existing objectives with some appropriate scaling 
factors. 

A similar effect can be achieved via static , but staged, fitness functions. 
These are staged in the sense that certain levels of fitness are only be made 
available to an individual once it has reached a minimum acceptable perfor- 
mance on all objectives at the previous level. If each level represents one of 
the objectives, individuals are then encouraged to evolve in directions that 
ensure that good performance is achieved and retained on all objectives. 



Koza et al. ( 1999 ) used this strategy when using GP for the evolution of 



electronic circuits where many criteria, such as input-output performance, 
power consumption, size, etc., must all be taken into account to produce 
good circuits. Kalganova and Miller (19991 used Cartesian GP (see Sec- 
tion 7.2.31 to design combinational logic circuits. A circuit’s fitness was 
given by a value between 0 and 100 representing the percentage of output 
bits that were correct. If the circuit was 100% functional, then a further 
component was added which represented the number of gates in the graph 
that were not involved in the circuit. Since all individuals had the same 
number of gates available in the Cartesian GP grid, this could be used to 
minimise the number of gates actually used to solve the problem at hand. 



9.4 MO GP via Operator Bias 

While it is very common to use only modifications of the selection phase to 
perform multi-objective search, it is also possible to combine MOO selection 
with genetic operators exhibiting an inbuilt search bias which can steer the 
algorithm towards optimising certain objectives. 

In some sense the classical repair operators, which are used in constrained 
optimisation to deal with hard constraints, are an extreme form of the idea 
of using operators to help MOO searc i,0 More generally, it is possible to 
imagine search operators with softer biases which favour the achievement 
of one or more objectives. These can be the same or different from the 
objectives that bias the selection of parents. 

In combinatorial optimisation, repair operators are applied to invalid offspring to 
modify them in such a way as to ensure a problem’s hard constraints are respected. 
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The pygmies and civil servants approach proposed in (Ryan 1994 1996) 
combines the separation typical of Pareto-based approaches with biased 
search operators. In this system two lists are built, one where individu- 
als are ranked based on fitness and the other where individuals are ranked 
based on a linear combination of fitness and size (i.e., a parsimonious fit- 
ness function). During crossover, the algorithm draws one parent from the 
first list and the other from the second list. This can be seen as a form of 
disassortative mating aimed at maintain diversity in the population. An- 



other example of this kind is (Zhang and Rockett 2005) where crossover 



was modified so that an offspring is retained only if it dominates either of 
its parents. 

Furthermore, as discussed in Sections |5.2| and |11.3.2| there are several 
mutation operators with a direct or indirect bias towards smaller programs. 
This provides a pressure towards the evolution of more parsimonious solu- 
tions throughout a run. 

As with the staged fitness functions discussed in the previous section, 
it is also possible to activate operators with a known bias towards smaller 
programs only when the main objective — say a 100% correct solution — has 
been achieved. This was tested in (Pujol 1999 Pujol and Poli 1997), where 
GP was used to evolve neural networks. After a 100% correct solution was 
found, one hidden node of each network in the population was replaced by 
a terminal, and the evolution process was resumed. This pruning procedure 
was repeated until the specified number of generations had been reached. 



Chapter 10 

Fast and Distributed 
Genetic Programming 



Users of all artificial intelligence tools are always eager to extend the bound- 
aries of their techniques, for example by attacking more and more difficult 
problems. In fact, to solve hard problems it may be necessary to push GP 
to the limit — populations of millions of programs and/or long runs may be 
necessary. 

There are a number of techniques to speed up, parallelise and distribute 
GP search. We start by looking at ways to reduce the number of fitness 
evaluations or increase their effectiveness (Section 10.1 1 and ways to speed 
up their execution (Section 10.21. We then look at the idea of running GP 
in parallel (Section 10. 3| ) and point out that faster evaluation is not the 
only reason for doing so, as geographic distribution has advantages in its 
own right. In Section |l0.4| we describe master-slave parallel architectures 
(Section 10.4.1| , running GP on graphics hardware (Section |10.4.2 I and 
FPGAs (Section 10.4.3). Section 10.4.4 describes a fast method to exploit 
the parallelism available on every computer. Finally, Section [103] concludes 
this chapter with a brief discussion of distributed, even global, evolution of 
programs. 



10.1 Reducing Fitness Evaluations and/or 
Increasing their Effectiveness 

While admirers of linear GP will suggest that machine code GP is the ul- 
timate in speed, all forms of GP can be made faster in a number of ways. 
The first is to reduce the number of times a program is evaluated. 

Many applications find the fitness of programs by running them on mul- 
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tiple training examples. The use of many examples provides an accurate 
evaluation of a program’s quality. However, ultimately the point of fitness 
evaluation is to make a binary decision — does this individual get a child or 
not? The overwhelming proportion of GP’s computational effort (or indeed 
the effort in any evolutionary computation technique) goes into adjusting 
the probability of this binary decision. However, it is not clear that a high- 
precision fitness evaluation is always necessary to decide well. Indeed, even 
when the fitness evaluation is very accurate, most selection algorithms Q be- 
ing stochastic, inject noise into the decision of which points in the search 
space to proceed from and which to abandon. In these cases, reducing the 
number of times a program is evaluated is effectively an additional source 
of noise. If a program has already demonstrated it works poorly compared 
to the rest of the population on a fraction of the available training data, it 
not likely to be chosen as a parent. Conversely, if it has already exceeded 
many programs in the population after being tested on only a fraction of 



the training set, then it is likely to be chosen as a parent (Langdon 1998). 



In either case, it is apparent that we do not gain much by running it on 
the remaining training examples. Teller and Andre (1997) developed these 
ideas into a useful algorithm called the rational allocation of trials. 

As well as the computational cost, there are other negatives consequences 
that come from using all the training data all the time, as doing so gives rise 
to a static fitness function. In certain circumstances this may encourage the 
population to evolve into a cul-de-sac where it is dominated by offspring of a 
single initial program which did well on some fraction of the training cases, 
but was unable to fit others. A static fitness function can create conditions 
where good programs that perform moderately well on most portions of the 
training data have lower fitness than those that do very well in only a few 
small regions. With high selection pressure, it takes surprisingly little time 
for the best individual to dominate the whole population]^] 

Gathercole and Ross ( 1994[ ]I997|) investigated a number of ways of dy- 



namically changing training samples]^] yielding a number of interacting ef- 
fects. Firstly, by using only a subset of the available data, the GP fitness 
evaluation took less time. Secondly, by changing which examples were being 



( Baker 



1 Common selection algorithms include roulette wheel selection ( | Goldberg! 1989] ) , SUS 



and tournament selection. 



■— *t! 


lis Is called the take over time i 


Goldberg| 


1989]). This can be formally analysed 


(Blicklel 1996; Droste, Jansen, Rudolp 


i, Schwefe 


, Tinnefeld, and Wegener| 2003}, but 



for tournament selection, a simple rule of thumb is often sufficient. If T is the tour- 
nament size, roughly logy (Pop size) generations are needed for the whole population to 
become descendants of a single individual. If, for example, we use binary tournaments 
(T = 2), then “take over” will require about ten generations for a population of 1,024. 
Alternatively, if we have a population of one million (10 6 ) and use ten individuals in each 
tournament (T = 10), then after about six generations more or less everyone will have 
the same greats g reats great 4 great 3 grand 2 motheri. 



^Siegel (1994) proposed a rather different implementation. 
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used over time, the evolving population saw more of the training data and so 
was less liable to over fit a fraction of them. Thirdly, by randomly changing 
the fitness function, it became more difficult for evolution to produce an 
overspecialised individual which took over the population at the expense of 
solutions which were viable on other parts of the training data. Dynamic 
subset selection (DSS) appears to have been the most successful of Gather- 
cole’s suggested algorithms. It has been incorporated into Discipulus (see 
page 63 1, and was recently used in a large data mining application (Curry, 



Lichodzijewski, and Heywood 2007). 



Where each fitness evaluation may take a long time, it may be attrac- 
tive to interrupt a long-running program in order to let others run. In GP 



systems which allow recursion or contain iterative elements (Brave 11996 



Langdon 


1998 


Wilson and Heywood 


2007 


Wong and Leung 



1996) it is 



common to enforce a time limit, a limit on the number of instructions ex- 
ecuted, or a bound on the number of times a loop is executed. 



Maxwell 



( 1994 ) proposed a solution to the question of what fitness to give to a pro- 
gram that has been interrupted. He allowed each program in the population 
a quantum of CPU time. When the program used up its quantum it was 
check-pointedj^] In Maxwell’s system, programs gained fitness as they ran, 
i.e., each time a program correctly processed a fitness case, its fitness was 
incremented. Tournament selection was then performed. If all members of 
the tournament had used the same number of CPU quanta, then the fitter 
program was the winner. If, however, one program had used less CPU than 
the others (and had a lower fitness) then it was restarted and run until it 
had used as much CPU as the others. Then fitnesses were compared in the 
normal way. 

Teller ( 1994 1 had a similar but slightly simpler approach: every indi- 



vidual in the population was run for the same amount of time. When the 
allotted time elapsed a program was aborted and an answer extracted from 
it, regardless of whether it had terminated or not. Teller called this an “any 
time” approach. This suits graph systems like Teller’s PADO (Section |7.2.2| ) 
or linear GP (Chapter 7.1 1 where it is easy to designate a register as the 
output register. The answer can then be extracted from this register or from 
an indexed memory cell at any point (including whilst the programming is 
running). Other any time approaches include (Spector and Alpern 19951 



and (Langdon and Poli 2008). 

A simple technique to speed up the evaluation of complex fitness func- 
tions is to organise the fitness function into stages of progressively increasing 
computational cost. Individuals are evaluated stage by stage. Each stage 
contributes to the overall fitness of a program. However, individuals need 



4 When a program is check-pointed, sufficient information (principally the program 
counter and stack) is saved so that it can later be restarted from where it was stopped. 
Many multi-tasking operating systems do something similar. 



86 



10 Fast and Distributed Genetic Programming 



to reach a minimum fitness value in each stage in order for them to be 
allowed to progress to the next stage and acquire further fitness. Often 
different stages represent different requirements and constraints imposed on 
solutions. 

Recently, a sophisticated technique called backward chaining GP has 

” 2005a|b 



been proposed (Poli 2005 Poli and Langdon 



2006a). In GP and 



other evolutionary algorithms which use tournament selection with small 
tournament sizes, backward chaining can radically reduce the number of 
fitness evaluations. Tournament selection randomly draws programs from 
the population to construct tournaments, the winners of which are then se- 
lected. Although this process is repeated many times in each generation, 
when the tournaments are small there is a significant probability that an 
individual in the current generation is never chosen to become a member 
of any tournament. By reordering the way operations are performed, back- 
ward chaining GP exploits this. It not only avoids fitness calculations for 
individuals that are never included in a tournament, but can also achieve 
higher fitness sooner. 



10.2 Reducing Cost of Fitness with Caches 



In computer hardware it is common to use data caches which automatically 
hold copies of data locally in order to avoid the delays associated with fetch- 
ing them from disk or over a network every time they are needed. This can 
work well when a small amount of data is needed many times over a short 
interval. 

Caches can also be used to store results of calculations, thereby avoiding 



the re-calculation of data (1 


dandley 


1 


994). 


amounts of common code 


(Langdon 


1998 



Langdon and Banzhaf 2005 



Langdon and Poli 2008 b This is, after all, how genetic search works: it 
promotes the genetic material of fit individuals. So, typically in each gener- 
ation we see many copies of successful code. 

In many (but by no means all) GP systems, subtrees have no side-effects. 
This means results pass through a program’s root node in a well organised 
and easy to understand fashion. Thus, if we remember a subtree’s inputs 
and output when it was run before, we can avoid re-executing code whenever 
we are required to run the subtree again. Note that this is true irrespective 
of whether we need to run the same subtree inside a different individual or 
at a different time (i.e., a later generation). Thus, if we stored the output 
with the root node, we would only need to run the subtree once for any 
given set of inputs. Whenever the interpreter comes to evaluate the subtree, 
it needs only to check if the subtree’s root contains a cache of the values the 
interpreter calculated last time, thus saving considerable computation time. 

In order to achieve this, however, we need to overcome a problem: not 
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only must the answer be stored, but the interpreter needs to know that the 
subtree’s inputs are the same too. The common practices of GP come to our 
aid here. Usually every tree in the population is run on exactly the same 
inputs for each of the fitness cases. Thus, for a cache to work, the interpreter 
does not need to know a tree’s inputs in detail, it need only know which of 
the fixed set of test cases was used. 

A simple means of implementing this type of cache is to store a vector of 
values returned by each subtree for each of the test cases. Whenever a sub- 
tree is created (i.e., in the initial generation, by crossover or by mutations) 
the interpreter is run and the cache of values for its root node is set. Note 
this is recursive, so caches can also be calculated for subtrees within it at 
the same time. Now, when the interpreter is run and comes to a subtree’s 
root node, it will simply retrieve the value it calculated earlier, using the 
test case’s number as an index into the cache vector. 

If a subtree is created by mutation, then its cache of values will be 
initially empty and will have to be calculated. However, this costs no more 
than it would without caches. 



When code is inserted into an existing tree, be it by mutation or 
crossover, the chance that the new code behaves identically to the old code 
is normally very small. This means that the caches of every node between 
the new code and the root node may be invalid. The simplest solution is 
to re-evaluate them all. This may sound expensive, but the caches in all 
the other parts of the individual remain valid and can be used when the 
cache above them is re-evaluated. Thus, in effect, if the crossed over code is 
inserted at depth d , only d nodes need to be evaluated. 



The whole question of monitoring how effective individual caches are, 
what their hit-rates are, etc. has been little explored. In practice, impressive 
savings have been achieved by simple implementations, with little monitor- 



ing and rudimentary garbage collection. Recent analysis (Ciesielski and Li 


2004 


Dignum and Poli 


2007 


Langdon and Poli 


2002 |Poli et al. 


2007) 



has shown that GP trees tend not to have symmetric shapes, and many 
leaves are very close to the root. This provides a theoretical explanation for 
why considerable computational saving can be made by using fitness caches. 
While it is possible to use hashing schemes to efficiently find common code, 
in practice assuming that common code only arises because it was inherited 
from the same location (e.g., by crossing over) is sufficient. 

As well as the original Directed acyclic graph (DAG) implementation 
(Handley 19941 other work includes (Ciesielski and Li 2004| [Keijzer 1996 



McPhee, Hopper, and Reierson 1998 Yangiya 1 9 9 5 [ ) . While so far we have 
only considered programs where no side effects take place, there are cases 
where caching can be extended outside this domain. For example, [Langdon 
(1998) used fitness caches in evolved trees with side effects by exploiting 



syntax rules about where in the code the side-effects could lie. 
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10.3 Parallel and Distributed GP are Not 
Equivalent 

There are two important aspects of parallel evolutionary algorithms which 
are equally important but are often confused. The first is the traditional 
aspect of parallel computing. That is, we port an existing algorithm onto 
a supercomputer so that it runs faster. The second aspect comes from the 
biological inspiration for evolutionary computation. 

In nature everything happens in parallel. Individuals succeed or fail in 
producing and raising children at the same time as other members of their 
species. These individuals are spread across oceans, lakes, rivers, plains, 
forests, mountain chains, etc. It was this geographic spread that led | Wright] 
(1932) to propose that geography and changes to it are of great importance 



to the formation of new species and, so, to natural evolution as a whole. 

Suppose a species occupies a range of hills. Individuals need not be able 
to move from one end of the range to another in their lifetime, but their 
descendents might. Wright (1932) proposed a mathematical model that can 



predict the amount of mixing between descendents across the entire range 
is needed to keep the whole population together as a single species. Based 
on his model, he predicted that only a few migrants per generation between 
hill tops are sufficient. 

Now suppose the sea level rises. What was once a continuous range 
of hills becomes a chain of islands. Suppose members of this species have 
limited ability to swim. If the islands are close and the ocean currents are 
sometimes favourable, it may be that every year a few individuals cross 
between neighbouring islands. This may be enough to constrain diversifi- 
cation and allow the population to remain a single species. However, if the 
gaps between island become larger, the chance of an individual occasionally 
crossing the sea and breeding becomes remote. On each island, then, the 
sub-populations begin to diverge and over time new species, specific to each 



island, are formed (Darwin, 1859). 



In nature, changes in conditions across regions can lead to correspond- 
ing differences in spatially distributed populations. Sometimes this can lead 
to new species, as in the example above. In other cases the variation can 
be gradual enough that there is no clear delineation that could be called a 
species boundary, but geographically distant individuals are unable or un- 
willing to mate, fulfilling a key property of different species. A particularly 
dramatic example of this is a ring species. The Larus gulls, for example, live 
along a ring that roughly follows the Arctic Circle. With one exception the 
variants can interbreed all along its range, despite often having differences 
significant enough that they have received different names. The key excep- 
tion is in Europe, where the “ends” of the range meet. There the Herring 
Gull ( Larus argentatus) and the Lesser Black-backed Gull ( Larus fuscus ) 
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intermingle, but rarely interbreed. 

The topology of the landscape is often a strong determiner of the spatial 
structure of a population. A large, fairly homogenous region, for example, 
might give rise to a species being spatially differentiated in two dimensions 
due to distance and climate. A river, on the other hand, may give rise to 
a linear distribution, especially if there are structures like water falls that 
restrict migration, and a river basin with several tributaries could lead to a 
tree structure. 

In evolutionary computation we can choose whether we want to model 
some form of geography. We can run GP on parallel hardware so as to speed 
up runs, but without introducing a notion of proximity that limits which 
individuals are allowed to mate. Alternatively, we can model some form of 
geography, introducing spatial structure as a result. 

In the following two sections we will discuss both ideas. It is important 
to note, however, that one does not need to use parallel hardware to use ge- 
ographically distributed GP populations. Although parallel hardware natu- 
rally lends itself to the implementation of physically-distributed populations, 
one can obtain similar benefits by using logically- distributed populations in 
a single machine. 



10.4 Running GP on Parallel Hardware 



In contrast to much of computer science, evolutionary computation can be 
readily run on parallel computer hardware; indeed it is “embarrassingly par- 
allel 



Andre and Koza 


1998 


). For example, when 


Openshaw and Turton 



(1994) ran GP on a Cray supercomputer they obtained about 30% of its 
theoretical peak performance, embarrassing their supercomputer savvy col- 
leagues who rarely got better than a few percent out of it. 

In Sections |10.4.1| - [T0.4.3 we look at three ways of running GP on par- 
allel hardware. Section 10.4.4 shows how to get 32 parallel operations from 
standard hardware. 



10.4.1 Master— slave GP 

If the objective is purely to speed up runs, we may want our parallel GP to 
work exactly the same as it did on a single computer. This is possible, but 
to achieve it we have to be very careful to ensure that, even if some parts of 
the population are evaluated more quickly, parallelisation does not change 
how we apply selection and which GP individual crosses over with which. 
Probably the easiest way to implement this is the master-slave model. 



In the master-slave model (Oussaidene, Chopard, Pictet, and Tomassini 



1997) breeding, selection crossover, mutation etc. occur just as they would 



on a single computer and only fitness evaluation is spread across a network 
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of computers. Each GP individual and its fitness cases are sent across the 
network to a different compute node. The central node waits for the compute 
nodes to return their individuals’ fitnesses. Since individuals and fitness 
values are typically stored in small data structures, this can be quite efficient 
since transmission overheads are limited. 

The central node is an obvious bottleneck. Also, a slow compute node 
or a lengthy fitness case will slow down the whole GP population, since 
eventually its result will be needed before moving onto the next generation. 



10.4.2 GP Running on GPUs 



Modern PC graphics cards contain powerful graphics processing units 
(GPUs) including a large number of computing components. For exam- 
ple, it is not atypical to have 128 streaming processors on a single PC’s 
graphics card. In the last few years there has been an explosion of interest 
in porting scientific or general purpose computation to mass market graphics 



2007). 



cards (Owens, Luebke, Govindaraju, Harris, Kruger, Lefohn, and Purcell 



Indeed, the principal manufactures (nVidia and ATI) claim faster than 



Moore’s Law (Moore 19651 increase in performance, suggesting that GPU 
floating point performance will continue to double every twelve months, 
rather than the 18-24 months observed for electronic circuits in general and 
personal computer CPUs in particular. In fact, the apparent failure of PC 
CPUs to keep up with Moore’s law in the last few years makes GPU comput- 
ing even more attractive. Even today’s bottom-of-the-range GPUs greatly 
exceed the floating point performance of their hosts’ CPU. However, this 
speed comes at a price, since GPUs provide a restricted type of parallel pro- 
cessing, often referred to a single instruction multiple data (SIMD) or single 
program multiple data (SPMD). Each of the many processors simultaneously 
runs the same program on different data items. 

There have been a few genetic programming experiments with GPUs 



(Chitty 2007 Ebner, Reinhardt, and Albert 



2005 



Harding and Banzhaf 



2007; Langdon and Banzhaf 


2008 [Langdon and Harrison 


2008| Loviscach 


and Meyer-Spradow, 2003] 


Meyer-Spradow and Loviscach) 2003) 


Reggia, 


Tagamets, Contreras-Vidal, Jacobs, Weems, Naqvi, Winder, Chabu 


<, Jung, 



and Yang 2006). So far, in GP, GPUs have just been used for fitness eval- 



uation. 



Harding and Banzhaf (2007]) used the Microsoft research GPU develop- 
ment DirectX iM tools to compile (using a technique originally developed by 



Harris and Buxton ( 1996)) a whole population of Cartesian GP network pro- 



grams into a single GPU program which was loaded onto a laptop’s GPU to 



run the fitness cases. Chitty (20071 used a conversion technique, somewhat 
like an interpreter, to automatically convert each GP tree into a program 
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Figure 10.1: nVidia 8800 Block diagram. The 128 1360 MHz Stream Processors are arranged in 16 blocks of 8. Blocks 
share 16 KB memory (not shown), an 8/1 KB LI cache, 4 Texture Address units and 8 Texture Filters. The 6x64 bit 
bus (dashed) links off chip RAM at 900 MHz. (Since there are two chips for each of the six off-chip memory banks, the 
bus is effectively running at up to 1800 Mhz per bank.) There are 6 Raster Operation Partitions. (nVidia, 2007). 
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that could be compiled for the GPU on the host PC. The compiled pro- 
grams were transferred one at a time to a GPU for fitness evaluation. Both 
groups obtained impressive speedups by running many test cases in parallel. 





Langdon and Banzhaf 


to 

o 

o 

00 


and 


_/angdon and Harrison ( 


2008) created 


a SIMD interpreter ( 


Juille and Pollack 


1996 


) using RapidMind’s GNU C++ 



OpenGL framework to simultaneously run up to a quarter of a million GP 
trees on an NVIDIA GPU (see Figure 10.1 1 As discussed in Section 7.1.2 



GP trees can be linearised. This avoids pointers and yields a very compact 
data structure; reducing the amount of memory needed in turn facilitates 
the use of large populations. To avoid recursive calls in the interpreter, 
Langdon used reverse polish notation (RPN), i.e., a post-fix rather than 
a pre-fix notation. Only small modifications are needed to crossover and 
mutation so that they act directly on the RPN expressions. This means the 
same representation is used on both the host and the GPU. Almost a billion 
GP primitives can be interpreted by a single graphics card per second. In 
both Cartesian and tree-based GP the genetic operations are done by the 
host CPU. Wong, Wong, and Fok (20051 showed, for a genetic algorithm, 
these too can be done by the GPU. 

Although each of the GPU’s processors may be individually quite fast 
and the manufacturers claim huge aggregate FLOPS ratings, the GPUs are 
optimised for graphics work. In practice, it is hard to keep all the processors 
fully loaded. Nevertheless 30 GFLOPS has been achieved (Langdon and 



Harrison 20081. Given the differences in CPU and GPU architectures and 



clock speeds, often the speedup from using a GPU rather than the host 
CPU is the most useful statistic. This is obviously determined by many 
factors, including the relative importance of amount of computation and 



size of data. The measured RPN tree speedups were 7.6-fold (Langdon and 



Harrison 2008) and 12.6-folcl (Langdon and Banzhaf 2008). 



10.4.3 GP on FPGAs 

Field programmable gate arrays (FPGAs) are chips which contain large ar- 
rays of simple logic processing units whose functionality and connectivity 
can be changed via software in microseconds by simply writing a configu- 
ration into a static memory. Once an FPGA is configured it can update 
all of its thousands of logic elements in parallel at the clock speed of the 
circuit. Although an FPGA’s clock speed is often an order of magnitude 
slower than that of a modern CPU, its massive parallelism makes it a very 
powerful computational device. Because of this and of their flexibility there 
has been significant interest in using FPGAs in GP. 

Work has ranged from the use of FPGAs to speed up fitness evaluation 



5 Bigger populations, e.g. five million programs jLangdon and Harrison] |2008| , are 
possible by loading them onto the GPU in 256k units. 
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(Koza, Bennett, Hutchings, Bade, Keane, and Andre 1997 Seok, Lee, and 



Zhang 20001 to the definition of specialised operators (Martin and Poli 



2002). It is even possible to implement a complete GP on FPGAs, as sug- 



gested in ( 


Heywood and Zincir-Heywood 2000 


Martin 


2001 


2002 


Sidlru, 


Mei, and Prasanna 


1998). A massively parallel C 


IP implementation has also 
to date all tests with that 


been proposed by Eklund 


(2001 


2004) although 



architecture have only been performed in simulation. 



10.4.4 Sub-machine-code GP 

We are nowadays so used to writing programs using high level sequential 
languages that it is very easy to forget that, underneath, computers have a 
high degree of parallelism. Internally, CPUs are made up of bit-slices which 
make it possible for the CPU to process all of the bits of the operands of an 
instruction in one go, in a single clock tick. 



Sub-machine- code GP (SMCGP) (Poli and Langdon 1999) is a technique 



to speed up GP and to extend its scope by exploiting the internal parallelism 
of sequential CPUs. In Boolean classification problems, SMCGP allows the 
parallel evaluation of 32 or 64 (depending on the CPU’s word size) fitness 
cases per program execution, thereby providing a significant speed-up. This 
has made it possible to solve parity problems with up to 4 million fitness 
cases (Poli and Page! 2000). SMCGP has also been applied with success 



in binary image classification problems (Adorni, Cagnoni, and Mordonini 



2002 Quintana, Poli, and Claridge 2003). The technique has also been 



extended to process multiple fitness cases per program execution in continu- 
ous symbolic regression problems where inputs and outputs are real-valued 



numbers (Poli 1999b). 



10.5 Geographically Distributed GP 

Unless some type of synchronisation is imposed, the parallel forms of GP 
in which different parts of a population are evolved by different processing 
elements will not be running the same algorithm as the standard single- 
CPU version of GP. Therefore, almost certainly, different parallelisations 
will produce different answers. However, as we discussed in Section |10.3| 
this is not necessarily a bad thing. 

Parallelisation itself can bring benefits similar to those hypothesised in 



natural populations by Wright (1932). In particular, the population is of- 



ten divided into semi-independent sub-populations called demes (Collins 



1992[|D’haeseleer and Bluming, 1994; Langdon[|1998| [Popovici and De Jong 



2006). The flow of genetic material between demes is restricted by limiting 
the exchange of individuals between them. The limit can be on the number 
of individuals that are allowed to migrate per generation. Alternatively the 



94 



10 Fast and Distributed Genetic Programming 




(a) 




(b) 



Figure 10.2: Spatially structured GP populations, (a) Toroidal grid of 
demes, where each deme (a node) contains a sub-population, and demes 
periodically exchange a small group of high-fitness individuals using a grid 
of communication channels, (b) Fine-grained distributed GP, where each 
grid cell contains one individual and where the selection of a mating partner 
for the individual in the centre cell is performed by executing a tournament 
among randomly selected individuals (e.g., the individuals shaded) in its 
3x3 neighbourhood. 



demes may be considered to be arranged in a “geographical” topology that 
constrains which demes can trade individuals. For example, it may be that 
with limited migration between compute nodes, the evolved populations on 
adjacent nodes will diverge, and that this increased diversity may lead to 
better solutions. Fernandez, Tomassini, and Vanneschi (2003), for exam- 



ple, report that distributing individuals between subpopulations offers an 
advantage in terms of quality of solutions and computational effort. 



When Koza first started using GP on a network of Transputers (Andre 



and Koza 1996), Andre experimentally determined the best migration rate 



for their problem. He suggested Transputers arranged in an asynchronous 
2-D toroidal square grid (such as the one in Figure 10.2 r) should exchange 
2% of their population with their four neighbours. 

Densely connected grids have been widely adopted in parallel GP. Usu- 
ally they allow innovative partial solutions to spread quickly. However, the 
GA community reported better results from less connected topologies, such 
as arranging the compute nodes’ populations in a ring, so that they could 



transfer genes only between themselves and their two neighbours (Stender 



1993). Potter (1997) argues in favour of spatial separation in populations 



and fine-grained distributed forms of GP (see Figure 10.2 a). Whitley (2001) 
gives some guidance on parallel genetic algorithms. 




Figure 10.3: A globally distributed GP system (Langdon 2005a). The 
server is the centre of the star architecture, with the lines connecting it 
to users around the world. The users evolved snowflake patterns using a 
continuously evolving L-System, and their (subjective) preferences provided 
the fitness measure used to drive the system. 



While many have looked enviously at Koza’s 1000 node Beowulf cluster 



(Sterling 1998) and other supercomputer realisations of GP (Bennett, Koza, 



Shipman, and Stiffelman 1999 Juille and Pollack 1996), a supercomputer is 



often not necessary. Many businesses and research centres leave computers 
permanently switched on. During the night their computational resources 
tend to be wasted. This computing power can easily and efficiently be 
used to execute distributed GP runs overnight. Typically, GP does not 
demand a high performance bus to interconnect the compute nodes, and 
so existing office Ethernet networks are often sufficient. While parallel GP 



systems can be implemented using MPI (Walker 20011 or PVM (Fernandez, 



Sanchez, Tomassini, and Gomez 19991, the use of such tools is not necessary: 



simple Unix commands and port-to-port HTTP is sufficient (Poli, Page, 



and Langdon 1999). The population can be split and stored on modest 



computers. With only infrequent interchange of parts of the population 
or fitness values little bandwidth is needed. Indeed a global population 



spread via the Internet ( 


Chong and Langdon 


1999 


Draves 2006 


Klein and 


Spector 


2007 


Langdon 


2005a 


1 , a la setiOhome, is perfectly feasible (see 


Figure ' 


0.3 


1. 





Other parallel GPs include ( 


Cheang, Leung, and Lee 


2006 


Folino, Piz- 


zuti, and Spezzano 2003; Gustafson and Burke 


2006 I 


Olein and Spector 


2007 Tanev, Uozumi, and Akhmetov 2004). 



Chapter 11 

GP Theory and its 
Applications 



Most of this book is about the mechanics of GP and its practical use for 
solving problems. In fact, as will become clear in Chapter [12] GP has 
been remarkably successful as a problem-solving and engineering tool. One 
might wonder how this is possible, given that GP is a non-deterministic 
algorithm, and as a result its behaviour varies from run to run. It is also a 
complex adaptive system which sometimes shows intricate and unexpected 
behaviours (such as bloat). Thus it is only natural to be interested in GP 
from the scientific point of view. That is, we want to understand why can 
GP solve problems, how it does it, what goes wrong when it cannot, what are 
the reasons for certain undesirable behaviours, what can we do to get rid of 
them without introducing new (and perhaps even less desirable) problems, 
and so on. 

GP is a search technique that explores the space of computer programs. 
The search for solutions to a problem starts from a group of points (random 
programs) in this search space. Those points that are above average quality 
are then used to generate a new generation of points through crossover, 
mutation, reproduction and possibly other genetic operations. This process 
is repeated over and over again until a stopping criterion is satisfied. If we 
could visualise this search, we would often find that initially the population 
looks like a cloud of randomly scattered points, but that, generation after 
generation, this cloud changes shape and moves in the search space. Because 
GP is a stochastic search technique, in different runs we would observe 
different trajectories. If we could see regularities, these might provide us 
with a deep understanding of how the algorithm is searching the program 
space for the solutions, and perhaps help us see why GP is successful in 
finding solutions in certain runs and unsuccessful in others. Unfortunately, 
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it is normally impossible to exactly visualise the program search space due 
to its high dimensionality and complexity, making it that much harder to 
understand. 

An alternative approach to better understanding the dynamics of GP is 
to study mathematical models of evolutionary search. There are a number 
of cases where this approach has been very successful in illuminating some 
of the fundamental processes and biases in GP systems. In this chapter we 
will review several theoretical approaches to understanding GP, including 
mathematical models of GP (Section 11.1 1, analyses of the structure of GP 
search spaces (Section [ll.2[ ), and the use of theory to understand and combat 
the chronic problem of bloat in a principled fashion (Section 11.31. 



11.1 Mathematical Models 

Schema theories are among the oldest and the best known models of evolu- 
tionary algorithms ( Holland] 1992 Whitley 19941. Schema theories are 



based on the idea of partitioning the search space into subsets, called 
schemata. They are concerned with modelling and explaining the dynamics 
of the distribution of the population over the schemata. Modern genetic 



algorithm schema theory (Stephens and Waelbroeck 1997 1999) provides 



exact information about the distribution of the population at the next gen- 
eration in terms of quantities measured at the current generation, without 
having to actually run the algorithm. 

The theory of schemata in GP has had a difficult childhood. Some excel- 



lent early efforts led to different worst-case-scenario schema theorems (Al- 



tenberg, 1994J 


Koza, 1992; O’Reilly and Oppacher|| 1994b; Poli and Langdon, 


1997 


Rosea 1997 


Whigham 


1995 


). Only very recently have the first ex- 



formulations (rather than lower bounds) for the expected number of individ- 
uals sampling a schema at the next generation. Initially (Poli 2000b. 2001a|), 



these exact theories were only applicable to GP with one-point crossover (see 



Section 5.3 ). However, more recently they have been extended to the class of 



homologous crossovers (Poli, McPhee, and Rowe 20041 and to virtually all 
types of crossovers that swap subtrees (Poli and McPhee 2003a|b I, including 
standard GP crossover with and without uniform selection of the crossover 



points (Section 2.4 1, one-point crossover, context-preserving crossover and 
size-fair crossover (which have been described in Section 5.3 1, as well as 
more constrained forms of crossover such as strongly-typed GP crossover 
(see Section 6.2.2| ), and many others. 

Other models of evolutionary algorithms include models based on 
Markov chain theory (e.g. (Davis and Principe 1993 jNix and Vose 1992[ ) ) 
and on statistical mechanics (e.g. (Priigel-Bennett and Shapiro 1994)). 



Markov models have been applied to GP (Mitavskiy and Rowe 2006 Poli 



et al. 2004 Poli, Rowe, and McPhee 2001), but so far they have not been 
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developed as fully as the schema theory model. 

Exact mathematical models of GP are probabilistic descriptions of the 
operations of selection, reproduction, crossover and mutation. They explic- 
itly represent how these operations determine which areas of the program 
space will be sampled by GP, and with what probability. These models treat 
the fitness function as a black box, however. That is, there is no represen- 
tation of the fact that in GP, unlike in other evolutionary techniques, the 
fitness function involves the execution of computer programs on a variety of 
inputs. In other words, schema theories and Markov chains do not tell us 
how fitness is distributed in the search space. Yet, without this information, 
we have no way of closing the loop and fully characterising the behaviour of a 
GP systems which is always the result of the interaction between the fitness 
function and the search biases of the representation and genetic operations 
used in the system. 



11.2 Search Spaces 



The characterisation of the space of computer programs explored by GP has 



been another main topic of theoretical research (Langdon and Poli 2002). 



Of course results describing the space of all possible programs are widely 
applicable, not only to GP and other search-based automatic programming 
techniques, but also to many other areas ranging from software engineering 
to theoretical computer science. 

In this category are theoretical results showing that the distribution of 
functionality of non Turing-complete programs approaches a limit as pro- 
gram length increases. That is, although the number of programs of a 
particular length grows exponentially with length, beyond a certain thresh- 
old the fraction of programs implementing any particular functionality is 
effectively constant. For example, in Figure [T 1 . 1| we plot the proportion of 
binary program trees composed of N AND gates which implement each of the 
2 2 = 256 Boolean functions of three inputs. Notice how, as the length of 
programs increases, the proportion of programs implementing each function 
approaches a limit. 

This does not happen by accident. There is a very substantial body of 
empirical evidence indicating that this happens in a variety of other systems. 
In fact, there are also mathematical proofs of these convergence results for 
two important forms of programs: Lisp (tree-like) S-expressions (without 
side effects) and machine code programs without loops (Langdon 2002a b, 
2003a|b[ 2005b Langdon and Poli 2002| ). That the limiting distribution of 
functionality reaches a limit as program length increases was also proven for 
a variety of other non- Turing complete computers and languages, including: 
a) cyclic (increment, decrement and NOP), b) bit flip computer (flip bit 
and NOP), c) any non-reversible computer, d) any reversible computer, 



100 



11 GP Theory and its Applications 




Figure 11.1: Proportion of N AND trees that yield each three-input func- 
tions. As circuit size increases the distribution approaches a limit. 



e) CCNOT (Toffoli gate) computer, f) quantum computers, g) the “average” 
computer and h) AND, NAND, OR, NOR expressions. 

Recently, ( Langdon and PoIi| 2006| Poli and Langdon |2006b| ) started ex- 
tending these results to Turing complete machine code programs. For this 
purpose, a simple, but realistic, Turing complete machine code language, 
T7, was considered. It includes: directly accessed bit addressable memory, 
an addition operator, an unconditional jump, a conditional branch and four 
copy instructions. A mathematical analysis of the halting process based on 
a Markov chain model of program execution and halting was performed. 
The model can be used to estimate, for any given program length, impor- 
tant quantities, such as the halting probability and the run time of halting 
programs. This showed a scaling law indicating that the halting probabil- 
ity for programs of length L is of order 1 /v L , while the expected number 
of instructions executed by halting programs is of order \fL. In contrast 
to many proposed Markov models, this can be done very efficiently, mak- 
ing it possible to compute these quantities for programs of tens of million 
instructions in a few minutes. Experimental results confirmed the theory. 
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11.3 Bloat 



Starting in the early 1990s, researchers began to notice that in addition to 
progressively increasing their mean and best fitness, GP populations also 
showed certain other dynamics. In particular, it was noted that very often 
the average size (number of nodes) of the programs in a population, after a 
certain number of generations in which it was largely static, at some point 
would start growing at a rapid pace. Typically the increase in program size 
was not accompanied by any corresponding increase in fitness. The origin 
of this phenomenon, which is known as bloat , has effectively been a mystery 
for over a decade. 



Note that there are situations where one would expect to see program 
growth as part of the process of solving a problem. For example, GP runs 
typically start from populations of small random programs, and it may be 
necessary for the programs to grow in complexity for them to be able to 
comply with all the fitness cases (a situation which often arises in continuous 
symbolic regression problems). So, we should not equate growth with bloat 
and we should define bloat as program growth without (significant) return in 
terms of fitness. 



Bloat is not only surprising, it also has significant practical effects: large 
programs are computationally expensive to evolve and later use, can be hard 
to interpret, and may exhibit poor generalisation. For these reasons bloat 
has been a subject of intense study in GP. Over the years, many theories 
have been proposed to explain various aspects of bloat, and while great 
strides have been made, we still lack a single, universally-accepted unifying 
theory to explain the broad range of empirical observations. We review the 
key theoretical results on bloat in Section 



While discussions on the causes of bloat were going on, practitioners have 
still had to face the reality of combating bloat in their runs. Consequently, 
a variety of effective practical techniques have been proposed to counteract 
bloat. We review these in Section 11.3.2 where we will particularly focus on 



the parsimony pressure method (Koza 1992 Zhang and Miihlenbein 1993 



1995 Zhang et al. 1997), which is perhaps the simplest and most frequently 



used method to control bloat in genetic programming. 



11.3.1 Bloat in Theory 

As mentioned above, there are several theories of bloat. Let us start by 
looking at three of the oldest ones: the replication accuracy theory, the 
removal bias theory and the nature of program search spaces theory. 
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Three Classic Explanations for Bloat 



The replication accuracy theory (McPhee and Miller 1995) states that the 



success of a GP individual depends on its ability to have offspring that are 
functionally similar to the parent. As a consequence, GP evolves towards 
(bloated) representations that increase replication accuracy. 

The nodes in a GP tree can often be crudely categorised into two classes: 
active code and inactive code. Roughly speaking, inactive code is code 
that is not executed, or is executed but its output is then discarded. All 



remaining code is active code. The removal bias theory (Soule and Foster 



1998a) observes that inactive code in a GP tree tends to be low in the tree, 



residing, therefore, in smaller-than-average-size subtrees. Crossover events 
excising inactive subtrees produce offspring with the same fitness as their 
parents. On average the inserted subtree is bigger than the excised one, so 
such offspring are bigger than average while retaining the fitness of their 
parent, leading ultimately to growth in the average program size. 



Finally, the nature of program search spaces theory (Langdon and Poli 



1997 Langdon, Soule, Poli, and Foster 1999) predicts that above a certain 



size, the distribution of fitnesses does not vary with size. Since there are more 
long programs, the number of long programs of a given fitness is greater than 
the number of short programs of the same fitness. Over time GP samples 
longer and longer programs simply because there are more of them. 



Executable Models of Bloat 



The explanations for bloat provided by these three theories are largely qual- 
itative. There have, however, been some efforts to mathematically formalise 
and verify these theories. For example, |Banzhaf and Langd on (2002) defined 
an executable model of bloat where only the fitness, the size of active code and 
the size of inactive code were represented (i.e., there was no representation of 
program structures). Fitnesses of individuals were drawn from a bell-shaped 
distribution, while active and inactive code lengths were modified by a size- 
unbiased mutation operator. Various interesting effects were reported which 
are very similar to corresponding effects found in GP runs. Rosea (2003 ) pro- 
posed a similar, but slightly more sophisticated model which also included 
an analogue of crossover. This provided further interesting evidence. 



A strength of these executable models is their simplicity. A weakness is 
that they suppress or remove many details of the representation and opera- 
tors typically used in GP. This makes it difficult to verify if all the phenom- 
ena observed in the model have analogues in GP runs, and if all important 
behaviours of GP in relation to bloat are captured by the model. 
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Size Evolution Equation 



In (Poli 2001b; Poli and McPhee 2003b), a size evolution equation for ge- 
netic programming was developed, which provided an exact formalisation of 
the dynamics of average program size. The original equation was derived 
from the exact schema theory for GP, and expressed mean program size as 
a function of the size and selection probabilities of particular schemata rep- 



resenting program shapes. The equation has recently been simplified (Poli 



and McPhee 2008b) giving: 



E[p(t + 1)] = ^£ x p(£,t), 



( 11 . 1 ) 



where /i(t + 1) is the mean size of the programs in the population at gen- 
eration t + 1, E is the expectation operator, £ is a program size, and p(£, t ) 
is the probability of selecting programs of size £ from the population in 
generation t. 

This equation can be rewritten in terms of the expected change in average 
program size as: 

EW + 1) - p{t)\ = Y, £ x (p(*. t) - *)). (11-2) 

t 



where $>(£, t) is the proportion of programs of size £ in the current genera- 
tion. Both equations apply to a GP system with selection and any form of 
symmetric subtree cross over p~| 

Note that Equations ( 11.1 1 and ( 11.2 1 do not directly explain bloat. They 
are, however, important because they constrain what can and cannot hap- 
pen size-wise in GP populations. Any explanation for bloat (including the 
theories summarised above) has to agree with Equations ( 11.1 1 and (11.21. 

In particular, Equation ( 11.1 1 predicts that, for symmetric subtree- 
swapping crossover operators, the mean program size evolves as if selection 
only was acting on the population. This means that if there is a change in 
mean size (bloat, for example) it must be the result of some form of positive 
or negative selective pressure on some or all of the length classes £. Equa- 
tion ( |11.2 1 shows that there can be bloat only if the selection probability 
p(£, t) is different from the proportion >!>(£, t) for at least some £. In par- 
ticular, for bloat to happen there will have to be some small £'s for which 
p(£,t) < &(£,t) and also some bigger £’s for which p(£,t) > <&(£, t) (at least 
on average). 



hna symmetric operator the probability of selecting particular crossover points in the 
parents does not depend on the order in which the parents are drawn from the population. 
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Crossover Bias Theory of Bloat 



We conclude this review on theories of bloat with a recent explanation for 



bloat called the crossover bias theory (Dignum and Poli, 2007 Poli et al. 



2007). This is based on and is consistent with the size evolution equation 



(Equation 11.1). 



On average, each application of subtree crossover removes as much ge- 
netic material as it inserts; consequently crossover on its own does not pro- 
duce growth or shrinkage. While the mean program size is unaffected, how- 
ever, higher moments of the distribution are. In particular, crossover pushes 
the population towards a particular distribution of program sizes, known 
as a Lagrange distribution of the second kind, where small programs have a 
much higher frequency than longer ones. For example, crossover generates a 
very high proportion of single-node individuals. In virtually all problems of 
practical interest, however, very small programs have no chance of solving 
the problem. As a result, programs of above average size have a selective 
advantage over programs of below average size, and the mean program size 
increases. 

Because crossover will continue to create small programs, which will then 
be ignored by selection (in favour of the larger programs), the increase in 
average size will continue generation by generation. 



11.3.2 Bloat Control in Practice 



Numerous empirical techniques have been proposed to control bloat (Lang- 



don et al. 1999 Soule and Foster 1998b). We cannot look at them all. 



However, we briefly review some of the most important. 



Size and Depth Limits 



Rather naturally, the first and simplest method to control code growth is the 
use of hard limits on the size or depth of the offspring programs generated 
by the genetic operators. 



Many implementations of this idea (e.g., (Koza 1992)) apply a genetic 
operator and then check whether the offspring is beyond the size or depth 
limit. If it isn’t, the offspring enters the population. If, instead, the off- 
spring exceeds the limit, one of the parents is returned. Obviously, this 
implementation does not allow programs to grow too large. However, there 
is a serious problem with this way of applying size limits, or more generally, 
constraints to programs: parent programs that are more likely to violate a 
constraint will tend to be copied (unaltered) more often than programs that 
don’t. That is, the population will tend to be filled up with programs that 
nearly infringe the constraint, which is typically not what is desired. 



11.3 Bloat 



105 



It is well known, for example, that depth thresholds lead to the popu- 
lation filling up with very bushy programs where most branches reach the 
depth limit (being effectively full trees). On the contrary, size limits produce 
populations of stringy programs which tend to all approach the size limit. 
See (Crane and McPhee 2005| McPhee, Jarvis, and Crane 2004 1 for more 
on the impact of size and depth limits, and the differences between them. 

The problem can be fixed by not returning parents if the offspring violates 
a constraint. This can be realised with two different strategies. Firstly, one 
can just return the oversize offspring, but give it a fitness of 0, so that 
selection will get rid of it at the next generation. Secondly, one can simply 
declare the genetic operation failed, and try again. This can be done in two 
alternative ways: a) the same parent or parents are used again, but new 
mutation or crossover points are randomly chosen (which can be done up 
to a certain number of times before giving up on those parents), or b) new 
parents are selected and the genetic operation is attempted again. 

If a limit is used, programs must not be so tightly constrained that they 
cannot express any solution to the problem. As a rule of thumb, one should 
try to estimate the size of the minimum possible solution (using the terminals 
and functions given to GP) and add some percentage (e.g., 50-200%) as a 
safety margin. In general, however, it may be hard to heuristically come up 
with good limits, so some trial and error may be required. Alternatively, 
one can use one of the many techniques that have been proposed to adjust 
size limits during runs. These can be both at the level of individuals and the 



population. See for example the work by Silva and Almeida (2003); Silva 
and Costa (2004 2005a|b l; Silva, Silva, and Costa (20051. 



Anti-bloat Genetic Operators 

One can control bloat by using genetic operators which directly or indirectly 
have an anti-bloat effect. 

Among the most recent bloat-control methods are size fair crossover 



and size fair mutation (Crawford-Marks and Spector 2002 Langdon 2000). 



These work by constraining the choices made during the execution of a 
genetic operation so as to actively prevent growth. In size-fair crossover, for 
example, the crossover point in the first parent is selected randomly, as in 
standard crossover. Then the size of the subtree to be excised is calculated. 
This is used to constrain the choice of the second crossover point so as 
to guarantee that the subtree chosen from the second parent will not be 
“unfairly” big. 

Older methods include several mutation operators that may help control 
the average tree size in the population while still introducing new genetic 



material. Kinnear (1993) proposes a mutation operator which prevents the 



offspring’s depth being more than 15% larger than its parent. Langdon 



( 1998) proposes two mutation operators in which the new random subtree is 
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on average the same size as the code it replaces. In Hoist mutation (Kinnear 
1994a) the new subtree is selected from the subtree being removed from the 



parent, guaranteeing that the new program will be smaller than its parent. 



Shrink mutation ( Angeline 1996) is a special case of subtree mutation where 



the randomly chosen subtree is replaced by a randomly chosen terminal. 



McPhee and Poli ( 2002 ) provides theoretical analysis and empirical evidence 



that combinations of subtree crossover and subtree mutation operators can 
control bloat in linear GP systems. 

Other methods which control bloat by exploiting the bias of the operators 
were discussed in Section ITTTTl 



Anti-Bloat Selection 

As clarified by the size evolution equation discussed in the previous section, 
in systems with symmetric operators, bloat can only happen if there are 
some longer-than-average programs that are fitter than average or some 
shorter-than-average programs that are less fit than average, or both. So, 
it stands to reason that in order to control bloat one needs to somehow 
modulate the selection probabilities of programs based on their size. 

As we have discussed in Section |9.2.1[ recent methods also include the 
use of multi-objective optimisation to control bloat. This typically involves 
the use of a modified selection based on the Pareto criterion. 

A recent technique, the Tarpeian method (Poli 2003), controls bloat 



by acting directly on the selection probabilities in Equation (11.2). This is 



done by setting the fitness of randomly chosen longer-than-average programs 
to 0. This prevents them being parents. By changing how frequently this 
is done the anti-bloat intensity of Tarpeian control can be modulated. An 
advantage of the method is that the programs whose fitness is zeroed are 
never executed, thereby speeding up runs. 



The well-known parsimony pressure method (Koza 1992 Zhang and 



Miihlenbein 1993 1995 Zhang et al. , 1997 1 changes the selection probabili- 



ties by subtracting a value based on the size of each program from its fitness. 
Bigger programs have more subtracted and, so, have lower fitness and tend 
to have fewer children. That is, the new fitness function is /( x) — c x £(x), 
where £(x) is the size of program x , f(x) is its original fitness and c is a con- 
stant known as the parsimony coefficient p| |Zhang and Miihlenbein (19951 
showed some benefits of adaptively adjusting the coefficient c at each gen- 
eration but most implementations actually keep the parsimony coefficient 
constant. 



2 While the new fitness is used to guide evolution, one still needs to use the original 
fitness function to recognise solutions and stop runs. 
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The parsimony pressure method can be seen as a way to address the 



generalisation-accuracy tradeoff common in machine learning (Rosea and 



Ballard 


1996b 


Zhang and Miihlenbein 


1995 



between this method and the Minimum Description Length (MDL) principle 
used to control bloat in (Iba |1997 Iba et al. 1994 Iba, de Garis, and 



Sato 1995a). The MDL approach uses a fitness function which combines 



program complexity (expressed as the number of bits necessary to encode 
the program’s tree) and classification error (expressed as the number of 
bits necessary to encode the errors on all fitness cases). Rosea also linked 
the parsimony pressure method to his approximate evolution equations for 



rooted-tree schemata (Rosea 1996 1997 Rosea and Ballard 1996b 1999). 



Controlling bloat while at the same time maximising fitness turns the 
evolution of programs into either a multi-objective optimisation problem or, 
at least, into a constrained optimisation problem. The parsimony pressure 
method effectively treats the minimisation of size as a soft constraint and 
attempts to enforce this constraint using the penalty method, i.e. , by decreas- 
ing the fitness of programs by an amount that depends on their size. The 
penalty is typically simply proportional to program size. The intensity with 
which bloat is controlled is, therefore, determined by the parsimony coeffi- 
cient. The value of this coefficient is very important: too small a value and 
runs will still bloat wildly; too large a value and GP will take the minimisa- 
tion of size as its main target and will almost ignore fitness, thus converging 
towards extremely small but useless programs ( Soule 1998). However, good 
values of the parsimony coefficient are highly dependent on particulars such 
as the problem being solved, the choice of functions and terminals, and vari- 
ous parameter settings. Furthermore, with a constant parsimony coefficient 
the method can only achieve partial control over the dynamics of the average 
program size over time. 

Recently, a theoretically sound method for setting the parsimony coeffi- 



cient in a principled manner has been proposed (Poli and McPhee 2008b). 



The covariant parsimony pressure method is based on an analysis of the size 
evolution Equation ( 11.1 ), and is easy to implement. It recalculates the par- 
simony coefficient c at each generation using c = Cov(£, /)/ Var(£), where 
Cov(£, /) is the covariance between program size £ and program fitness / 
in the population, and Var(£) is the variance of program sizes. Note that c 
needs to be recalculated each generation because both Cov(£, /) and Var(£) 
change from generation to generation. As shown in Figure 11. 2| (in the por- 
tion labelled “Local”), using this equation ensures that the mean program 
size remains at the value set by the initialisation procedure (although there 
can be a small amount of drift). There is a variant of the method that allows 
the user to even decide what function the mean program size should follow 
over time. As shown in the figure this provides complete control over the 
population size dynamics. 
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Avg size vs. time, different target size functions 




Figure 11.2: Plots of the evolution average size over 500 generations for 
multiple runs of the 6-MUX problem with various forms of covariant parsi- 
mony pressure. The “Constant” runs had a constant target size of 150. In 
the “Sin” runs the target size was sin ((generation + l)/50.0) x 50.0 + 150. 
For the “Linear” runs the target size was 150 + generation. The “Limited” 
runs used no size control until the size reached 250, then the target was held 
at 250. Finally, the “Local” runs used c= Cov(£, /)/ Var(£), which allowed 
a certain amount of drift but still avoided runaway bloat (see text). 



Part III 



Practical Genetic 
Programming 




Three little pigs provide a demonstration of construction techniques. . . 



and Goldilocks finally gets it just right. 
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Chapter 12 

Applications 



Since its early beginnings, GP has produced a cornucopia of results. The 
literature, which covers more than 5000 recorded uses of GP, reports an 
enormous number of applications where GP has been successfully used as 
an automatic programming tool, a machine learning tool or an automatic 
problem-solving engine. It is impossible to list all such applications here. 
In the following sections we start with a discussion of the general kinds 
of problems where GP has proved successful (Section 12.1 1 and then re- 
view a representative subset for each of the main application areas of GP 
(Sections |12.2 - 12.11 1, devoting particular attention to the important areas 
of symbolic regression (Section |12.2 I and human-competitive results (Sec- 
tion 12.3). 



12.1 Where GP has Done Well 

Based on the experience of numerous researchers over many years, it appears 
that GP and other evolutionary computation methods have been especially 
productive in areas having some or all of the following properties: 

The interrelationships among the relevant variables is unknown 
or poorly understood (or where it is suspected that the cur- 
rent understanding may possibly be wrong). One of the partic- 
ular values of GP (and other evolutionary algorithms) is in exploring 
poorly understood domains. If the problem domain is well understood, 
there may well be analytical tools that will provide quality solutions 
without the uncertainty inherent in a stochastic search process such 
as GP. GP, on the other hand, has proved successful where the appli- 
cation is new or otherwise not well understood. It can help discover 
which variables and operations are important; provide novel solutions 
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to individual problems; unveil unexpected relationships among vari- 
ables; and, sometimes GP can discover new concepts that can then be 
applied in a wide variety of circumstances. 

Finding the size and shape of the ultimate solution is a major 
part of the problem. If the form of the solution is known, then 
alternative search mechanisms that work on fixed size representations 
(e.g., genetic algorithms) may be more efficient because they won’t 
have to discover the size and shape of the solution. 

Significant amounts of test data are available in computer- 
readable form. GP (and most other machine learning and search 
techniques) benefit from having significant pools of test data. At a 
minimum there needs to be enough data to allow the system to learn 
the salient features, while leaving enough at the end to use for valida- 
tion and over-fitting tests. It is also useful if the test data are as clean 
and accurate as possible. GP is capable of dealing gracefully with 
certain amounts of noise in the data (especially if steps are taken to 
reduce over-fitting) , but cleaner data make the learning process easier 
for any system, GP included. 

There are good simulators to test the performance of tentative 
solutions to a problem, but poor methods to directly obtain 
good solutions. In many domains of science and engineering, sim- 
ulators and analysis tools have been constructed that allow one to 
evaluate the behaviour and performance of complex artifacts such as 
aircraft, antennas, electronic circuits, control systems, optical systems, 
games, etc. These simulators contain enormous amounts of knowledge 
of the domain and have often required several years to create. These 
tools solve the so-called direct problem of working out the behaviour 
of a solution or tentative solution to a problem, given the solution it- 
self. However, the knowledge stored in such systems cannot be easily 
used to solve the inverse problem of designing an artifact from a set 
of functional or performance requirements. A great advantage of GP 
is that it is able to connect to simulators and analysis tools and to 
“data-mine” the simulator to solve the inverse problem automatically. 
That is, the user need not specify (or know) much about the form of 
the eventual solution before starting. 

Conventional mathematical analysis does not, or cannot, provide 
analytic solutions. If there is a good exact analytic solution, one 
probably wants to use it rather than spend the energy to evolve what 
is likely to be an approximate solution. That said, GP might still be 
a valuable option if the analytic solutions have undesirable properties 
(e.g., unacceptable run times for large instances), or are based on 
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assumptions that don’t apply in one’s circumstances (e.g., noise-free 
data). 

An approximate solution is acceptable (or is the only result that 
is ever likely to be obtained). Evolution in general, and GP in 
particular, is typically about being “good enough” rather than “the 
best”. (A rabbit doesn’t have to be the fastest animal in the world: 
it just has to be fast enough to escape that particular fox.) As a 
result, evolutionary algorithms tend to work best in domains where 
close approximations are both possible and acceptable. 



Small improvements in performance are routinely measured (or 
easily measurable) and highly prized. Technological efforts tend 
to concentrate in areas of high economic importance. In these domains, 
the state of the art tends to be fairly advanced, and, so, it is difficult 
to improve over existing solutions. However, in these same domains 
small improvements can be extremely valuable. GP can sometimes 
discover small, but valuable, relationships. 



Two (of many) examples of successful applications of GP that satisfy 
many of these properties are the work of Lohn, Hornby, and Linden (2004) on 



satellite antenna design and Spector’s evolution of new quantum computing 



algorithms that out-performed all previous approaches (Spector, Barnum, 


and Bernstein 1998 


Spector, Barnum, Bernstein, and Swamy 


1999). Both 



of these domains are complex, without analytic solutions, yet in both cases 
good simulators existed which could be used to evaluate the fitness of so- 
lutions. In other words, people didn’t know how to solve the problems but 
they could (automatically) recognise a good solution when they saw one. 
Both of these applications resulted in the discovery of highly successful and 
unexpected designs. The key component of the evolved quantum algorithm 
could in fact be extracted and applied in a wide variety of other settings, 
leading to major improvements in a number of related quantum algorithms 
as well as the ones under specific study. 



12.2 Curve Fitting, Data Modelling and 
Symbolic Regression 

In principle, there are as many possible applications of GP as there are ap- 
plications for programs — in other words, virtually infinite. However, before 
one can try to solve a new problem with GP, one needs to define an appropri- 
ate fitness function. In problems where only the side effects of a program are 
of interest, the fitness function usually compares the effects of the execution 
of a program in some suitable environments with a desired behaviour, often 
in a very application-dependent manner. However, in many problems the 
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goal is to find a function whose output has some desired property, e.g., the 



function matches some target values (as in the example given in Section 4.1 1 . 
This is generally known as a symbolic regression problem. 

Many people are familiar with the notion of regression. Regression means 
finding the coefficients of a predefined function such that the function best 
fits some data. A problem with regression analysis is that, if the fit is not 
good, the experimenter has to keep trying different functions by hand until 
a good model for the data is found. Not only is this laborious, but also 
the results of the analysis depend very much on the skills and inventiveness 
of the experimenter. Furthermore, even expert users tend to have strong 
mental biases when choosing functions to fit. For example, in many applica- 
tion areas there is a considerable tradition of using only linear or quadratic 
models, even when the data might be better fit by a more complex model. 

Symbolic regression attempts to go beyond this. It consists of finding 
a function that fits the given data points without making any assumptions 
about the structure of that function. Since GP makes no such assumption, 
it is well suited to this sort of discovery task. Symbolic regression was one 
of the earliest applications of GP (Koza 1992), and continues to be widely 



studied (Cai, Pacheco- Vega, Sen, and Yang 



2006 



Gustafson, Burke, and 



Krasnogor 


to 

o 

0 

01 


Keijzer 2004[ Lew, Spencer, Scarpa, Worden, Rutherford, 


and Hemez 


2006 


). 



The steps necessary to solve symbolic regression problems include the five 
preparatory steps mentioned in Chapter [2] We practiced them in the exam- 
ple in Chapter [4j which was an instance of a symbolic regression problem. 
There is an important difference here, however: the data points provided in 
Chapter [4] were computed using a simple formula, while in most realistic sit- 
uations each point represents the measured values taken by some variables 
at a certain time in some dynamic process, in a repetition of an experiment, 
and so on. So, the collection of an appropriate set of data points for symbolic 
regression is an important and sometimes complex task. 



For instance, consider the case of using GP to evolve a soft sensor (Jor- 



daan, Kordon, Chiang, and Smits 2004). The intent is to evolve a function 



that will provide a reasonable estimate of what a sensor (in an industrial 
production facility) would report, based on data from other actual sensors 
in the system. This is typically done in cases where placing an actual sensor 
in that location would be difficult or expensive. However, it is necessary to 
place at least one instance of such a sensor in a working system in order to 
collect the data needed to train and test the GP system. Once the sensor 
is placed, one would collect the values reported by that sensor and by all 
the other real sensors that are available to the evolved function, at various 
times, covering the various conditions under which the evolved system will 
be expected to act. 

Such experimental data typically come in large tables where numerous 
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quantities are reported. Usually we know which variable we want to predict 
(e.g., the soft sensor value), and which other quantities we can use to make 
the prediction (e.g., the hard sensor values). If this is not known, then 
experimenters must decide which are going to be their dependent variables 
before applying GP. Sometimes, in practical situations, the data tables 
include hundreds or even thousands of variables. It is well known that 
in these cases the efficiency and effectiveness of any machine learning or 
program induction method, including GP, can dramatically drop as most of 
the variables are typically redundant or irrelevant. This forces the system 
to waste considerable energy on isolating the key features. To avoid this, 
it is necessary to perform some form of feature selection , i.e., we need to 
decide which independent variables to keep and which to leave out. There 
are many techniques to do this, but these are beyond the scope of this book. 
However, it is worth noting that GP itself can be used to do feature selection 
as shown in (Langdon and Buxton 2004). 

There are problems where more than one output (prediction) is required. 
For example, Table |12.1| shows a data set with four variables controlled 
during data collection (left) and six dependent variables (right). The data 
were collected for the purpose of solving an inverse kinematics problem in the 



Elvis robot (Langdon and Nordin 2001). The robot is shown in Figure 12.1 



during the acquisition of a data sample. The roles of the independent and 
dependent variables are swapped when GP is given the task of controlling 
the arm given data from the robot’s eyes. 

There are several GP techniques which might be used to deal with ap- 
plications where multiple outputs are required: GP individuals including 

page 



multiple trees (as in Figure 2.2 



11), linear GP with multiple output 



registers (see Section 7.1), graph-based GP with multiple output nodes (see 
Section 7.2 1, a single GP tree with primitives operating on vectors, and so 
forth. 



Once a suitable data set is available, its independent variables must all 
be represented in the primitive set. What other terminals and functions are 
included depends very much on the type of the data being processed (are 
they numeric? are they strings? etc.) and is often guided by the information 
available to the experimenter and the process that generated the data. If 
something is known (or strongly suspected) about the desired structure of 
the function to be evolved, it may be very beneficial to use this information 
(or to apply some constraints, like those discussed in Section 6.2 1 . For 
example, if the data are known to be periodic, then the function set should 
probably include something like the sine function. 



What is common to virtually all symbolic regression problems is that 
the fitness function must measure how close the outputs produced by each 
program are to the values of the dependent variables, when the correspond- 
ing values of the independent ones are used as inputs for the program. So, 
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Table 12.1: Samples showing the size and location of Elvis’s finger tip 
as apparent to this two eyes, given various right arm actuator set points (4 
degrees of freedom). Cf. Figure 12.1 When the data are used for training, 



GP is asked to invert the mapping and evolve functions from data collected 
by both cameras showing a target location to instructions to give to Elvis’s 
four arm motors so that its arm moves to the target. 



Arm actuator 


X 


Left 

y 


eye 

size 


Right eye 
x y size 


-376 


-626 


1000 


-360 


44 


10 


29 


-9 


12 


25 


-372 


-622 


1000 


-380 


43 


7 


29 


-9 


12 


29 


-377 


-627 


899 


-359 


43 


9 


33 


-20 


14 


26 


-385 


-635 


799 


-319 


38 


16 


27 


-17 


22 


30 


-393 


-643 


699 


-279 


36 


24 


26 


-21 


25 


20 


-401 


-651 


599 


-239 


32 


32 


25 


-26 


28 


18 


-409 


-659 


500 


-200 


32 


35 


24 


-27 


31 


19 


-417 


-667 


399 


-159 


31 


41 


17 


-28 


36 


13 


-425 


-675 


299 


-119 


30 


45 


25 


-27 


39 


8 


-433 


-683 


199 


-79 


31 


47 


20 


-27 


43 


9 


-441 


-691 


99 


-39 


31 


49 


16 


-26 


45 


13 






continues for 


a total of 691 lines 







most symbolic regression fitness functions tend to include summing the er- 



rors measured for each record in the data set, as we did in Section 4.2.2 
Usually either the absolute difference or the square of the error is used. 



The fourth preparatory step typically involves choosing a size for the 
population (which is often done initially based on the perceived difficulty of 
the problem, and is then refined based on the actual results of preliminary 
runs). The user also needs to set the balance between the selection strength 
(normally tuned via the tournament size) and the intensity of variation 
(which can be varied by modifying the mutation and crossover rates, but 
many researchers tend to fix to some standard values) . 
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Figure 12.1: Elvis sitting with its right hand outstretched. The apparent 
position and size of a bright red laser attached to its finger tip is recorded 
(see Table 12.1 ). The data are then used to train a GP to move the robot’s 
arm to a spot in three dimensions using only its eyes. 



12.3 Human Competitive Results: 
The Humies 



Getting machines to produce human-like results is the very reason for the 
existence of the fields of artificial intelligence and machine learning. How- 
ever, it has always been very difficult to assess how much progress these 
fields have made towards their ultimate goal. Alan Turing understood that 
in order to avoid human biases when assessing machine intelligence, machine 
behaviour must be evaluated objectively. This led him to propose an imi- 



tation game, now known as the Turing test (Turing 1950). Unfortunately, 



the Turing test is not usable in practice, and so, there is a need for more 
workable objective tests of machine intelligence. 



Koza, Bennett, and Stiffelman (1999) suggested shifting attention from 



the notion of intelligence to the notion of human competitiveness. A result 
cannot acquire the rating of “human competitive” merely because it is en- 
dorsed by researchers inside the specialised fields that are attempting to 
create machine intelligence. A result produced by an automated method 
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must earn the rating of “human competitive” independently of the fact that 
it was generated by an automated method. 

Koza proposed that an automatically-created result should be considered 
“human-competitive” if it satisfies at least one of these eight criteria: 



1. The result was patented as an invention in the past, is an improvement 
over a patented invention or would qualify today as a patentable new 
invention. 

2. The result is equal to or better than a result that was accepted as a new 
scientific result at the time when it was published in a peer-reviewed 
scientific journal. 

3. The result is equal to or better than a result that was placed into a 
database or archive of results maintained by an internationally recog- 
nised panel of scientific experts. 

4. The result is publishable in its own right as a new scientific result, 
independent of the fact that the result was mechanically created. 

5. The result is equal to or better than the most recent human-created 
solution to a long-standing problem for which there has been a succes- 
sion of increasingly better human-created solutions. 

6. The result is equal to or better than a result that was considered an 
achievement in its field at the time it was first discovered. 

7. The result solves a problem of indisputable difficulty in its field. 

8. The result holds its own or wins a regulated competition involving 
human contestants (in the form of either live human players or human- 
written computer programs). 



These criteria are independent of, and at arm’s length from, the fields of 
artificial intelligence, machine learning, and GP. 

Over the years, dozens of results have passed the human-competitiveness 
test. Some pre-2004 human-competitive results include: 



• Creation of quantum algorithms, including a better-than-classical al- 
gorithm for a database search problem and a solution to an AND / OR 



query problem (Spector et al. 1998 1999). 



Creation of a competitive soccer-playing program for the RoboCup 1997 



competition (Luke 1998). 



• Creation of algorithms for the transmembrane segment identification 
problem for proteins (Koza 1994[ Sections 18.8 and 18.10) and (Koza 
et al. 1999 Sections 16.5 and 17.2). 
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• Creation of a sorting network for seven items using only 16 steps (Koza 
Sections 21.4.4, 23.6, and 57.8.1). 



et al. 1999 



• Synthesis of analogue circuits (with placement and routing, in some 

[1999] 



cases), including: 60- and 96-decibel amplifiers (Koza et al. 



Section 45.3); circuits for squaring, cubing, square root, cube root, 



logarithm, and Gaussian functions ( 


Koza et al. 1999 Section 47.5.3); 


a circuit for time-optimal control of a robot ( 


Koza et al. 


1999 Section 


48.3); an electronic thermometer (JKoza et al. 


1999 


Section 49.3); a 


voltage-current conversion circuit (Koza, Keane, Streeter, Mydlowec, 



Yu, and Lanza 2003 Section 15.4.4). 



Creation of a cellular automaton rule for the majority classification 



problem that is better than all known rules written by humans (Andre 



et al. 1996). 



• Synthesis of topology for controllers, including: a PID (proportional, 
integrative, and derivative) controller (Koza et al. 2003) Section 9.2) 
and a PID-D2 (proportional, integrative, derivative, and second deriva- 
tive) controller ( Koza et ak| |2003[ Section 3.7); PID tuning rules that 
outperform the Ziegler-Nichols and Astrom-Hagglund tuning rules 
Chapter 12); three non-PID controllers that out- 



(Koza et al. , 2003 



perform a PID controller that uses the Ziegler-Nichols or Astrom- 



Hagglund tuning rules (Koza et al. 2003 Chapter 13). 



In total (Koza and Poli 2005) lists 36 human-competitive results. These 
include 23 cases where GP has duplicated the functionality of a previously 
patented invention, infringed a previously patented invention, or created a 
patentable new invention. Specifically, there are fifteen examples where GP 
has created an entity that either infringes or duplicates the functionality of 
a previously patented 20 tft, -century invention, six instances where GP has 
done the same with respect to an invention patented after 1 January 2000, 
and two cases where GP has created a patentable new invention. The two 
new inventions are general-purpose controllers that outperform controllers 
employing tuning rules that have been in widespread use in industry for 
most of the 20 th century. 

Many of the pre-2004 results were obtained by Koza. However, since 
2004, a competition has been held annually at ACM’s Genetic and Evolu- 
tionary Computation Conference (termed the Human-Competitive awards 
- the Humies). The $10,000 prize is awarded to projects that have pro- 
duced automatically-created results which equal or better those produced 
by humans. 

The Humies Prizes have typically been awarded to applications of evo- 
lutionary computation to high-tech fields. Many used GP. For example, 
the 2004 gold medals were given for the design, via GP, of an antenna for 
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Figure 12.2: Award winning human-competitive antenna design produced 
by GP. 



deployment on NASA’s Space Technology 5 Mission (see Figure [T2.2[ ) (|Lohn 
et al. 20041 and for evolutionary quantum computer programming (Spec- 



tor 2004 1 . There were three silver medals in 2004: one for the evolution of 



local search heuristics for SAT using GP (Fukunaga 20041, one for the ap- 



plication of GP to the synthesis of complex kinematic mechanisms (Lipson 



2004 ) and one for organisation design optimisation using GP ( KHosraviani 



2003 



KHosraviani, Levitt, and Koza 2004). Also, four of the 2005 medals 



were awarded for GP applications: the invention of optical lens systems (Al- 



Sakran, Koza, and Jones 2005 Koza, Al-Sakran, and Jones 2005 ), the evo- 



lution of a quantum Fourier transform algorithm (Massey, Clark, and Step- 



ney 2005 1 , evolving assembly programs for Core War ( Corno, Sanchez, and 



Squillero 20051 and various high-performance game players for Backgam- 



mon, Robocode and Chess endgame ( 


Azaria and Sipper 


2005a|b 


Haupt- 


man and Sipper 


2005 


Shichel, Ziserman, and Sipper 2005). In 2006, GP 



again scored a gold medal with the synthesis of interest point detectors for 
image analysis (Trujillo and Olague 2006a|b I, while it scored a silver medal 
in 2007 with the evolution of an efficient search algorithm for the Mate-in-N 
problem in Chess (Hauptman and Sipper 2007|) (see Figure 12.31. 



Note that many human competitive results were presented at the Humies 
2004-2007 competitions (e.g., 11 of the 2004 entries were judged to be human 
competitive). However, only the very best were awarded medals. So, at the 
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Figure 12.3: Example mate-in-2 problem. 



time of writing we estimate that there are at least 60 human competitive 
results obtained by GP. This shows GP’s potential as a powerful invention 
machine. 



12.4 Image and Signal Processing 



Hampo and Markoj (jT992j) were among the first people from industry to 
consider using GP for signal processing. They evolved algorithms for pre- 
processing electronic motor vehicle signals for possible use in engine moni- 
toring and control. 

Several applications of GP for image processing have been for military 
uses. For example, Tackett (1993) evolved algorithms to find tanks in in- 



frared images. Howard, Roberts, and Brankin (1999); Howard, Roberts, and 



Ryan (2006) evolved programs to pick out ships from SAR radar mounted 



on satellites in space and to locate ground vehicles from airborne photo re- 
connaissance. They also used GP to process surveillance data for civilian 
purposes, such as predicting motorway traffic jams from subsurface traffic 



speed measurements (Howard and Roberts 2004). 



Using satellite SAR radar, Daida, Homines, Bersano-Begey, Ross, and 



Vesecky (1996) evolved algorithms to find features in polar sea ice. Opti- 



cal satellite images can also be used for environmental studies (Chami and 



Robilliard 20021 and for prospecting for valuable minerals (Ross, Gualtieri, 



Fueten, and Budkewitsch, 20051. 



Alcazar used GP to find recurrent filters (including artificial neural net- 
works ( Esparcia- Alcazar and Sharman 1996j) for one-dimensional electronic 
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signals (Sharman and Esparcia- Alcazar 1993). Local search (simulated an- 
nealing or gradient descent) can be used to adjust or fine-tune “constant” 



values within the structure created by genetic search (Smart and Zhang 



2004). 



Yu and Bhanu ( 


2006 


have used GP to preprocess images, particularly 


of human faces, to find regions of interest for subsequent analysis. See also 


(Trujillo and Olague 


2006a). 



Zhang has been particularly active at evolving programs with GP to 



visually classify objects (typically coins) ( 


Zhang and Smart 


2006) 


also applied GP to human speech ( 


Xie, Zhang, and Andreae 


2006) 



“Parisian GP” is a system in which the image processing task is split 
across a swarm of evolving agents (“flies”). In (Louchet 2001 Louchet, 



Guyon, Lesot, and Boumaza 2002) the flies reconstruct three dimensions 



from pairs of stereo images. For example, in (Louchet 2001 J, as the flies 
buzz around in three dimensions their position is projected onto the left and 
right of a pair of stereo images. The fitness function tries to minimise the 
discrepancy between the two images, thus encouraging the flies to settle on 
visible surfaces in the 3-D space. So, the true 3-D space is inferred from 
pairs of 2-D images taken from slightly different positions. 

While the likes of Google have effectively indexed the written word, for 
speech and pictures indexing has been much less effective. One area where 
GP might be applied is in the automatic indexing of images. Some initial 



steps in this direction are given in 


Theiler, Harvey, Brumby, Szymanski, 


Alferink, Perkins, Porter, and Bloch 


1999). 



To some extent, extracting text from images (OCR) can be done fairly 
reliably, and the accuracy rate on well formed letters and digits is close 



2005) such as Arabic ( . 


-Classen and Heywood 2002) and oriental languages, 


handwriting (De Stefano, Cioppa, and Marcelli 


2002 j Gagne and Parizeau 


2006| Krawiec 2004 Teredesai and Govindaraju 


2005) (such as the MNIST 


examples), other texts 


(Rivero, nal, Dorado, and Pazos, 2004) and musical 


scores (Quintana, Poli, and Claridge 2006). 



The scope for applications of GP to image and signal processing is almost 

1996b). GP image 



unbounded. A promising area is medical imaging (Poli 



techniques can also be used with sonar signals ( Martin 2006 ) . Off-line work 
on images includes security and verification. For example, |Usman, Khan,| 
Chamlawi, and Majid (2007) have used GP to detect image watermarks 



which have been tampered with. Recent work by Zhang has incorporated 



multi-objective fitness into GP image processing (Zhang and Rockett, 2006). 

In 1999 Poli, Cagnoni and others founded the annual European Work- 
shop on Evolutionary Computation in Image Analysis and Signal Processing 
(EvoIASP). EvoIASP is held every year with the EuroGP. Whilst not solely 
dedicated to GP, many GP applications have been presented at EvoIASP. 
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12.5 Financial Trading, Time Series 

Prediction and Economic Modelling 

GP is very widely used in the areas of financial trading, time series prediction 
and economic modelling and it is impossible to describe all its applications. 
It this section we will hint at just a few areas. 

Chen has written more than 60 papers on using GP in finance and eco- 
nomics. Recent papers have looked at the modelling of agents in stock 



markets (Chen and Liao 20051, game theory (Chen, Duffy, and Yeh 20021, 



evolving trading rules for the S&P 500 (Yu and Chen 20041 and forecasting 



the Heng-Sheng index (Chen, Wang, and Zhang 1999). 



The efficient markets hypothesis is a tenet of economics. It is founded 
on the idea that everyone in a market has “perfect information” and acts 
“rationally”. If the efficient markets hypothesis held, then everyone would 
see the same value for items in the market and so agree the same price. 
Without price differentials, there would be no money to be made from the 
market itself. Whether it is trading potatoes in northern France or dollars 
for yen, it is clear that traders are not all equal and considerable doubt has 
been cast on the efficient markets hypothesis. So, people continue to play the 
stock market. Game theory has been a standard tool used by economists to 
try to understand markets but is increasingly supplemented by simulations 
with both human and computerised agents. GP is increasingly being used 
as part of these simulations of social systems. 



Neely, Weller, and Dittmar (19971, Neely and Weller (1999 20011 and 



Neely (2003) of the US Federal Reserve Bank used GP to study intra-day 
technical trading on the foreign exchange markets to suggest the market is 
“efficient” and found no evidence of excess returns. This negative result 



was criticised by 


Marney, Miller, Fyfe, and Tarbert 


(2001). Later work by 


Neely, Weller, and Ulrich 


(2006) suggested that data after 1995 are consis- 



tent with Lo’s adaptive markets hypothesis rather than the efficient markets 
hypothesis. Note that here GP and computer tools are being used in a 
novel data-driven approach to try and resolve issues which were previously 
a matter of dogma. 

From a more pragmatic viewpoint, Kaboudan shows GP can forecast in- 
ternational currency exchange rates (|Kaboudan 20051, stocks (Kaboudan 



2000) and stock returns (Kaboudan 11999). Tsang and his co-workers con- 



tinue to apply GP to a variety of financial arenas, including: betting (Tsang, 



Li, and Butler 



and Li 2002 



19981, forecasting stock prices (Li and Tsang 1999 Tsang 



Tsang, Yung, and Li 2004), studying markets (Martinez- 



Jaramillo and Tsang 2007), approximating Nash equilibrium in game the- 



ory (Jin 2005 Jin and Tsang 2006 Tsang and Jin 20061 and arbitrage 



(Tsang, Markose, and Er 2005). Dempster and HSBC also use GP in for- 



eign exchange trading (Austin, Bates, Dempster, Leemans, and Williams 
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2004 


Dempster and Jones[|2000; Dempster, Payne, Romahi, and Thompson, 


2001 


). Pillay has used GP in social studies and teaching aids in education, 


e.g. 


Pillay 2003). As well as trees (Koza 


1990), other types of GP have 


been used in finance, e.g. (Nikolaev and Iba 


to 

o 

o 

to 



Since 1995 the International Conference on Computing in Economics 
and Finance (CEF) has been held every year. It regularly attracts GP pa- 
pers, many of which are on-line. In 2007 Brabazon and O’Neill established 
the European Workshop on Evolutionary Computation in Finance and Eco- 
nomics (EvoFIN). EvoFIN is held with EuroGP. 



12.6 Industrial Process Control 



There is evidence that GP is frequently used in industrial process control, 
although, of course, most industrialists have little time to spend on aca- 
demic reporting. A notable exception is Dow Chemical, where a group has 



been very active in publishing results (Castillo, Kordon, and Smits 2006a 



Castillo, Kordon, Smits, Christenson, and Dickerson 2006bf Jordaan, den| 


Doelder, and Smits) 


2006 Kordon, Castillo, Smits, and Kotanchek 


2005 


Kotanchek et al. 


2006 


Mercure, Smits, and Kordon 


20011. 


Kordon 


(2006) 



describes where industrial GP stands now and how it will progress. 

Another active collaboration is that of Kovacic and Balic, who used GP 
in the computer numerical control of industrial milling and cutting machin- 



ery 


Kovacic and Balic 2003 1 . The partnership of Deschaine and Francone 


(Francone and Deschaine 20041 is most famous for their use of Discipu- 


lus ( 


Foster 


2001 


1 for detecting bomb fragments and unexploded ordnance 
). Discipulus has also been used as an aid in the develop- 


( Deschaine 


2006 



ment of control systems for rubbish incinerators (Deschaine, Patel, Guthrie, 



Grimski, and Ades 2001). 

One of the earliest users of GP in control was Willis’ Chemical Engi- 
neering group at Newcastle, which used GP to model flow in a plasticating 
extruder (Willis, Hiden, and Montague 1997a). Other GP applications in 



Willis, Searson, and Montague (2000) also modelled extruding food. Sear- 



the plastics industry include (Brezocnik, Balic, and Gusel 20001. McKay, 



son, Montague, and Willis (1998) modelled control of chemical reactions in 



continuous stirred tank reactors. Marenbach (1998) investigated GP in the 



control of biotech reactors. |Willis, Hi den, Marenbach, McKay, and Mon- 
tague (1997b) surveyed GP applications, including in the area of control. 



Lewin, Lachman-Shalem, and Grosman (2006) and Dassau, Grosman, 
and Lewin (|2006 1 applied GP to the control of an integrated circuit fabrica- 



tion plant. Domingos worked on simulations of nuclear reactors (PWRs to 



be exact) to devise better ways of preventing xenon oscillations (Domingos, 
2005|). GP has also been used to identify the state 



Schirru, and Martinez 



of a plant to be controlled (in order to decide which of various alternative 
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control laws to apply). For example, Fleming’s group in Sheffield used multi- 



objective GP ( 


Hinchliffe and Willis 2003 


Rodriguez-Vazquez, Fonseca, and 


Fleming) 2004 


) to reduce the cost of running aircraft jet engines (Arkov, 



Evans, Fleming, Hill, Norton, Pratt, Rees, and Rodriguez- Vazquez} 2000 



Evans, Fleming, Hill, Norton, Pratt, Rees, and Rodriguez- Vazquez 2001). 



Alves da Silva and Abrao (2002) surveyed GP and other AI techniques 



applied in the electrical power industry. 



12.7 Medicine, Biology and Bioinformatics 



GP has long been applied to medicine, biology and bioinformatics. Early 



work by Handley ( 1993 1 and Koza and Andre ( 1996 1 used GP to make pre- 



dictions about the behaviour and properties of biological systems, principally 
proteins. Oakley, a practising medical doctor, used GP to model blood flow 



in toes (Oakley 19941 as part of his long term interests in frostbite. 

In 2002 Banzhaf and Foster organised BioGEC: the first GECCO work- 
shop on biological applications of genetic and evolutionary computation. 
BioGEC has become a bi-annual feature of the annual GECCO conference. 
Half a year later Marchiori and Corne organised EvoBio: the European con- 
ference on evolutionary computation, machine learning and data mining in 
bioinformatics. EvoBio is held every year alongside EuroGP. GP figures 
heavily in both BioGEC and EvoBIO. 

GP is often used in biomedical data mining. Of particular medical in- 



terest are very wide data sets, with many inputs per sample (Lavington, 



Dewhurst, Wilkins, and Freitas 1999). Examples include infrared spectra 



(Ellis, Broadhurst, and Goodacre 2004 Ellis, Broadhurst, Kell, Rowland, 



and Goodacre, 2002( Goodacre 2003| 


Goodacre, Shann, Gilbert, Timmins, 


McGovern, Alsberg, Kell, and Logan, 2000) Harrigan, LaPlante, Cosma, 


Cockerell, Goodacre, Maddox, Luyendyk, Ganey, and Roth 2004| 


John- 



son, Gilbert, Winson, Goodacre, Smith, Rowland, Hall, and Kell) |2000[ 
McGovern, Broadhurst, Taylor, Kaderbhai, Winson, Small, Rowland, Kell, 
and Goodacre 2002 Taylor, Goodacre, Wade, Rowland, and Kell 1998 



Vaidyanathan, Broadhurst, Kell, and Goodacre[ 2003), single nuclear poly- 



2004 Shah and Kusiak 



morphisms (Barrett 2003 Reif, White, and Moore 



2004), chest pain (Bojarczuk, Lopes, and Freitas 



and Von Zuben 2004 Eri ksson and Olsson 2004 Heidema, Boer, Nagelk- 
erke, Mariman, van der A, and Feskens| 2006 Ho, Hsieh, Chen, and Huang[ 



2000), and Affymetrix 



GeneChip microarray data (de Sousa, de C. T. Gomes, Bezerra, de Castro, 



2006[ |Hong and Cho| |2006| |Langdon and Buxton[ 2004[ Li, Jiang, Li, Moser, 



Guo, Du, Wang, T op ol, Wang, and R ao 2005j Lind en an d Blraya |2007| [Yu, 
Yu, Almal, Dhanasekaran, Ghosh, Worzel, and Chinnaiyan 2007). 
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Kell and his colleagues in Aberystwyth have had great success in applying 
GP widely in bioinformatics (see infrared spectra above and (Allen, Davey, 



Broac 


hurst, Heald, Rowland, Oliver, and Kell| 2003 |Day, Kell, and Griffith 


2002 


Gilbert, Goodacre, Woodward, and Kell, 


1997 Goodacre and Gilbert 


1999 


Jones, Young, Taylor, Kell, and Rowlanc 


1998 


Kell 2002a|b c| Kell, 


Darby, and Draper! 200l[ Shaw, Winson, Woodwarc 


, McGovern, Davey, 


Kaderbhai, Broadhurst, Gilbert, Taylor, Timmins, Goodacre, Kell, Alsberg, 



and Rowland 2000 Woodward, Gilbert, and Kell[ 1999)). Another very 



active group is that of Moore and his colleagues (Moore, Parker, Olsen, and 



Aune 2002| |Motsinger, Lee, Mellick, and Ritchie 2006 Ritchie, Motsinger, 
Bush, Coffey, and Moore[ 2007| Ritchie, White, Parker, Hahn, and Moore| 
20031). 



Computational chemistry is widely used in the drug industry. The prop- 
erties of simple molecules can be calculated. However, the interactions be- 
tween chemicals which might be used as drugs and medicinal targets within 
the body are beyond exact calculation. Therefore, there is great interest in 
the pharmaceutical industry in approximate in silico models which attempt 
to predict either favourable or adverse interactions between proto-drugs and 
biochemical molecules. Since these are computational models, they can be 
applied very cheaply in advance of the manufacturing of chemicals, to decide 
which of the myriad of chemicals might be worth further study. Potentially, 
such models can make a huge impact both in terms of money and time 
without being anywhere near 100% correct. Machine learning and GP have 



both been tried. GP approaches include (Bains, Gilbert, Sviridenko, Gas- 



con, Scoffin, Birchall, Harvey, and Caldwell 2002 


Barrett and Langdon 


2006; Buxton, Langdon, and Barrett, 200l[ 


Felton 


2000; Globus, Lawton, 


and Wipke, 1998; Goodacre, Vaidyanathan, Dunn, Harrigan, and Kell, 2004 j 


Harrigan et al.| 2004| (Hasan, Daugelat, Rao 


, and Sclrreiber[ 2006; Krasno- 


gor 2004 Si, Wang, Zhang, Hu, and Fan 2006; Venkatraman, Dalby, and 



Yang 2004 Weaver 2004). 



12.8 GP to Create Searchers and Solvers — 
Hyper-heuristics 

Hyper-heuristics could simply be defined as “heuristics to choose other 



heuristics” (Burke, Kendall, Newall, Hart, Ross, and Schulenburg 2003). 
A heuristic is considered as a rule-of-thumb or “educated guess” that re- 
duces the search required to find a solution. The difference between meta- 
heuristics and hyper-heuristics is that the former operate directly on the 
problem search space with the goal of finding optimal or near-optimal so- 
lutions. The latter, instead, operate on the heuristics search space (which 
consists of the heuristics used to solve the target problem). The goal then 
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is finding or generating high-quality heuristics for a problem, for a certain 
class of instances of a problem, or even for a particular instance. 

GP has been very successfully used as a hyper heuristic. For example, GP 



naga, 2002 Kibria and Li 


2006), state-of-the-art or better than state-of-the- 


art bin packing algorithms (Burke, Hyde, and Kendall 


2006f Burke, Hyde, 


Kendall, and Woodward 


2007 Poli, Woodward, and Burke 2007), particle 


swarm optimisers (Poli, Di Chio, and Langdon 


2005 


Poli, Langdon, and 


Holland 2005 ) , evolutionary algorithms 


'Oltean 


2005 ) , and travelling sales- 


man problem solvers (Keller and Poli 


2007a|b 


c Oltean and Dumitrescu 



2004). 



12.9 3/ 4 Entertainment and Computer Games 

Today, a major usage of computers is interactive games (Priesterjahn, 



Kramer, Weimer, and Goebels 2006). There has been some work on in- 



corporating artificial intelligence into mainstream commercial games. The 
software owners are not keen on explaining exactly how much AI they use 
or giving away sensitive information on how they use AI. Work on GP and 



games includes (Azaria and Sipper 2005a Langdon and Poli 2005 Vowk, 
Wait, and Schmidt 2004|) as well as the human-competitive game players 
mentioned in Section 12. 3| page |120| Funes reports experiments which at- 
tracted thousands of people via the Internet who were entertained by evolved 



Tron players (Funes, Sklar, Juille, and Pollack 1998b). 



Since 2004, the annual IEEE CEC conference has included sessions on 
evolutionary computation in games. After chairing the IEEE Symposium 
on Computational Intelligence and Games 2005, at Essex University, Si- 
mon Lucas founded the IEEE Computational Intelligence Society’s Techni- 
cal Committee on Games. GP features heavily in the Games TC’s activi- 
ties having being applied to Othello, poker, backgammon, draughts, chess, 
Ms Pac-Man, robotic football and radio controlled model car racing. 



12.10 The Arts 

Computers have long been used to create purely aesthetic artifacts. Much 
of today’s computer art tends to ape traditional drawing and painting, pro- 
ducing static pictures on a computer monitor. However, the immediate 
advantage of the computer screen — movement — can also be exploited. In 
both cases evolutionary computation can and has been exploited. Indeed, 
with evolution’s capacity for unlimited variation, evolutionary computation 
offers the artist the scope to produce ever changing works. Some artists 
have also worked with sound. 
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The use of GP in computer art can be traced b ack at least to the work 
of Sims (Sims 1991) and Lathamp] Jacob’s work (Jacob 2000 20011 pro- 



P] Jaco 

l ('2006 



vides many examples. McCormack (2006) considers the recent state of play 



in evolutionary art and music. Many recent techniques are described in 

1999) has been dominated by 



(Machado and Romero 2008). 



Evolutionary music (Todd and Werner 



Jazz (Spector and Alpern 1994). An exception is Bach (Federman, Spark- 



man, and Watt 1999). Most approaches to evolving music have made at 
least some use of interactive evolution (Takagi 20011 in which the fitness 



of programs is provided by users, often via the Internet (Ando, Dahlsted, 
Nordahl, and Iba 2007 Chao and Forrest 20031. The limitation is al- 



most always finding enough people willing to participate (Langdon 20041. 



Costelloe and Ryan (20071 tried to reduce the human burden. Algorithmic 



2002 ). 



approaches are also possible (Cilibrasi, Vitanyi, and de Wolf 2004 Inagaki 



One of the sorrows of AI is that as soon as it works it stops being AI (and 
celebrated as such) and becomes computer engineering. For example, the 
use of computer generated images has recently become cost effective and is 
widely used in Hollywood. One of the standard state-of-the-art techniques 
is the use of Reynold’s swarming “boids” (Reynolds, 1987) to create ani- 



mations of large numbers of rapidly moving animals. This was first used in 
Cliffhanger (1993) to animate a cloud of bats. Its use is now commonplace 
(herds of wildebeest, schooling fish, and even large crowds of people). In 
1997 Reynold was awarded an Oscar. 

Since 2003, EvoMUSART (the European Workshop on Evolutionary Mu- 
sic and Art) has been held every year along with the EuroGP conference as 
part of the EvoStar event. 



12.11 Compression 



Koza ( 1992| ) was the first to use genetic programming to perform compres- 
sion. He considered, in particular, the lossy compression of images. The idea 
was to treat an image as a function of two variables (the row and column 
of each pixel) and to use GP to evolve a function that matches as closely as 
possible the original. One can then use the evolved GP tree as a lossy com- 
pressed version of the image, since it is possible to obtain the original image 
by evaluating the program at each row-column pair of interest. The tech- 
nique, which was termed programmatic compression , was tested on one small 
synthetic image with good success. Programmatic compression was further 



developed and applied to realistic data (images and sounds) by Nordin and 
Banzhaf ( 1996 ) . 



1 http : //www . williamlathaml . com/ 
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Iterated Functions System (IFS) are important in the domain of frac- 
tals and the fractal compression algorithm. Lutton, Levy-Vehel, Cretin, j 
Glevarec, and Roll ( 1995a| b) used genetic programming to solve the inverse 
problem of identifying a mixed IFS whose attractor is a specific binary (black 
and white) image of interest. The evolved program can then be taken to rep- 
resent the original image. In principle, this can then be further compressed. 
The technique is lossy, since rarely the inverse problem can be solved ex- 
actly. No practical application or compression ratio results were reported 
in (Lutton et al. |1995aj b). Using similar principles, |Sarafopoulos| ( |1999| ) 
used GP to evolve affine IFSs whose attractors represent a binary image 
containing a square (which was compressed exactly) and one containing a 
fern (which was achieved with some error in the finer details). 

Wavelets are frequently used in lossy image and signal compression. 



Klappenecker and May ( 1995 1 used GP to evolve wavelet compression algo- 



rithms (internal nodes represented conjugate quadrature filters, leaves rep- 
resented quantisers). Results on a small set of real-world images were im- 
pressive, with the GP compression outperforming JPEG at all compression 
ratios. 



The first lossless compression technique (Fukunaga and Stechert 1998) 



used GP to evolve non-linear predictors for images. These were used to 
predict the gray level a pixel will take based on the gray values of a subset 
of its neighbours (those that have already been computed in a row-by-row 
and column-by-column scan of the image array). The prediction errors to- 
gether with the model’s description represent a compressed version of the 
image. These were compressed using the Huffman encoding. Results on five 
images from the NASA Galileo Mission database were very promising with 
GP compression outperforming some of the best human-designed lossless 
compression algorithms. 

In many compression algorithms some form of pre-processing or transfor- 
mation of the original data is performed before compression. This often im- 



proves compression ratios. Parent and Nowe (2002) evolved pre-processors 



for image compression using GP. The objective of the pre-processor was to 
reduce losslessly the entropy in the original image. In tests with five images 
from the Canterbury corpus, GP was successful in significantly reducing the 
image entropy. As verified via the application of bzip2, the resulting images 
were markedly easier to compress. 



In (Krantz, Lindberg, Thorburn, and Nordin 2002) the use of program- 



matic compression was extended from images to natural videos. A program 
was evolved that generates intermediate frames of video sequence, where 
each frame is composed by a series of transformed regions from the adjacent 
frames. The results were encouraging in the sense that a good approxima- 
tion to frames was achieved. While a significant improvement in compres- 
sion was achieved, programmatic compression was very slow in comparison 
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with standard compression methods, the time needed for compression being 
measured in hours or even days. Acceleration in GP image compression was 
achieved in (He, Wang, Zhang, Wang, and Fang 20051, where an optimal 



linear predictive technique was proposed which used a less complex fitness 
function. 

Recently Kattan and Poli (2008) proposed a GP system called GP-ZIP 
for lossless data compression based on the idea of optimally combining well- 
known lossless compression algorithms. The file to be compressed was di- 
vided into chunks of a predefined length, and GP was asked to find the best 
possible compression algorithm for each chunk in such a way as to minimise 
the total length of the compressed file. The compression algorithms avail- 
able to GP-ZIP included arithmetic coding, Lempel-Ziv- Welch, unbounded 
prediction by partial matching, and run length encoding among others. Ex- 
perimentation showed that when the file to be compressed is composed of 
heterogeneous data fragments (as it is the case, for example, in archive files), 
GP-zip is capable of achieving compression ratios that are significantly su- 
perior to those obtained with other compression algorithms. 



Chapter 13 

Troubleshooting GP 



The dynamics of evolutionary algorithms (including GP ) are often very com- 
plex, and the behaviour of an EA is typically challenging to predict or un- 
derstand. As a result it is often difficult to troubleshoot such systems when 
they are not performing as expected. While we obviously cannot provide 
troubleshooting suggestions that are specific to every GP implementation 
and application, we can suggest some general issues to keep in mind. To a 



large extent the advice in (Kinnear 1994b Koza 1992 Langdon 
remains sound. 



1998) also 



13.1 Is there a Bug in the Code? 

Machine learning systems are notoriously difficult to protect from coding 
and logical mistakes. Unless a mistake produces a runtime error, it may 
remain hidden in a system for a long time and may contribute to the system 
achieving unsatisfactory results. Such mistakes are difficult to find because 
the system, being adaptive, will still work to some degree. This is also true 
of GP. 

The most common reaction to a system not producing satisfactory results 
is to start playing with the parameters, the fitness function, the primitive set, 
etc. However, one should also consider the possibility of a coding mistake. 
The normal program validation techniques, such as inspection of critical 
regions of code, should be used to ensure everything is alright. 

If the code is part of an established GP implementation, coding errors 
are less likelyQ A more probable source of coding errors is stretching the 
GP library beyond its original intended use. Reading the manual carefully 
is sometimes a good preventive cure for problems. 

1 Coding errors cannot be entirely excluded, though, especially if a GP library is large 
and provides a rich set of features and functionalities. 
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13.2 Can you Trust your Results? 



Since GP is a stochastic search algorithm, different runs may have different 
outcomes and yield different results. Because of this, one needs to be very 
careful in making inferences regarding the degree of success of the system 
from a small set of runs. 

It is possible, for example, to run a GP system 10 times on a particular 
problem, observe that all 10 runs failed to find a solution, and conclude that 
GP cannot solve the problem. However, if the success probability is say 5% 
with a particular choice of parameters and representation, the probability of 
doing 10 runs and all of them failing is almost 60%! So, the failure to solve 
the problem in these 10 runs should not come as a surprise, even though 
there’s a reasonable chance that you would find a solution if you did more 
runs. 

For precisely this reason, it is very important to do enough runs and 
use appropriate statistical tests to ensure that conclusions are statistically 
significant. 

GP runs can often be very time consuming, especially if the fitness func- 
tion is computationally expensive. While parallel and distributed computing 



(see Section 10.4) can significantly speed up the process, tools from the de- 



sign of experiments literature (Bartz-Beielstein 2006) can also be used to 



reduce the number of different runs that are necessary to explore the space 
in a statistically sound manner. 

A common GP application is classification, e.g., evolving a program or 
function that can classify patient biopsy data into two categories: cancerous 
or benign. There are numerous pitfalls in this type of work, such as using 
all the available data as training data, thereby leaving nothing to use for 
validating your evolved solution on unseen data. There is a broad literature 
on this and related subjects, and numerous tools such as cross-validation 
that one can use when not enough data are available. (See, for example, 



(Hastie, Tibshirani, and Friedman 2001!).) The aim must be to ensure that 



your results can be trusted to work in the real world, rather than in just the 
synthetic environment created by the fitness cases we chose. 



13.3 There are No Silver Bullets 

When working on real problems there are not likely to be any silver bullets. 
No technique (including GP) is likely to solve all instances of an NP-hard 
problem in an amount of time that grows linearly with the size of the prob- 
lem. GP has proven extremely successful in a wide variety of domains (e.g., 
Chapter [l2| ) but that doesn’t mean that it will work immediately or easily 
in every domain, or even that it is the best tool for a specific domain. 

While some of the successes in the field have been “easy” , most were the 
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result of significant effort by experienced practitioners. It is likely that for 
every GP approach that has successfully solved a problem, several others 
have failed. It is in the nature of academic publishing that one does not get 
to hear about failures. 

So, don’t expect immediate success, and don’t become too discouraged 
by poor early results. 



13.4 Small Changes can have Big Effects 

Don’t assume “a little fiddling” with parameters, operators, fitness func- 
tions, etc., is harmless. One of the awkward realities of many widely appli- 
cable tools is that they typically have numerous tunable parameters. Evo- 
lutionary algorithms such as GP are no exception. Often changing a pa- 
rameter or two can have a fairly minimal impact, and averaging over many 
runs is required to reliably detect those effects. Some parameter changes, 
however, can produce more dramatic effects. Changing the function set, for 
example, can significantly change the distribution of the sizes and shapes of 
trees, especially in the early generations, and potentially bias the system in 
unexpected ways. 

Another source of change can be the problem domain. A common mis- 
take is to hope that parameter settings that worked well for one problem 
will also work well for what appears to be a very similar problem. Problems 
that appear similar to humans, however, may have quite different search 
characteristics. 

In addition, there are many small differences in GP implementations that 
are rarely considered important or even reported. However, our experience is 
that they may produce significant changes in the behaviour of a GP system. 
Differences as small as an “>’ in place of a “>’ in an if statement can have 
an important effect. For example, the substitution “>’ <-> “>’ may influence 
the winners of tournaments, the designation of the best-of-run individual, 
the choice of which elements are cloned when elitism is used, or the offspring 
produced by operators which accept the offspring only if it is better or not 
worse than a parent. 



13.5 Big Changes can have No Effect 

When big changes appear to make little difference, this can sometimes be 
used to identify problems with the domain representation and fitness mea- 
sure. Alternatively it may be that the problem is simply too difficult, and 
no change is likely to make a significant difference. 

Suppose that you’re not making much progress during a set of runs. One 
might react by sweeping the parameter space, doing runs with a variety of 
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different parameter settings in the hope of finding a better collection of 
parameter values. What if changing the parameter values really does not 
have much impact? That may mean that GP just is not able to gain any 
traction given your current representation of the problem domain and fitness 
function. You might, therefore, reconsider how the problem is posed to GP. 
If the representation and fitness make the problem essentially a search for a 
needle in a haystack, then GP will mostly be lost searching through highly 
sub-optimal solutions. If so, altering parameter values is unlikely to help. 

Note that essentially the same symptoms are also observed if the problem 
is really beyond the capabilities of your computing resources. For example, 
if the solutions are exceptionally rare, unless there are nice fitness gradients 
guiding GP towards them, finding any solution will likely be beyond the 
capacity of current computer technology. 

How can one distinguish which is the cause of the lack of success? Is it a 
bad choice of representation and fitness or is it just an extremely hard prob- 
lem? To answer these questions, it is important to look at what happened 
when the population size was varied. Even in the absence of fitness guid- 
ance, GP will search. In fact, it will perform a sort of random exploration 
of the search space. It may not be a particularly rational exploration 
we know, for example, that GP with subtree crossover tends to oversample 
and re-sample short programs — yet, it is still a form of stochastic search. 
Thus, one may expect that, if the problem is solvable, as the population 
size is progressively increased, sooner or later we should start seeing some 
variation in the fitness of programs. This may be sufficient for evolution to 
work with, particularly if we help it by improving the representation and 
fitness function. If, instead, nothing interesting happened as the population 
size was increased, then perhaps you don’t have enough computing power 
to solve the problem as posed, or the problem has no solution. 

13.6 Study your Populations 

If you’re not getting your desired results, it is important to take the time 
to dig around in the populations and see what is actually being evolved)^] 
For example, if you’re using ADFs because you think that your problem 
would benefit from a modular solution, examine the individuals that you’re 
evolving. Are they using ADFs? (Sometimes the result producing branch 
simply will not refer to the ADFs at all.) Are they using them in a modular 
way? Are ADFs being used multiple times? Do the ADFs encapsulate some 
interesting logic, or are they just re-naming an input variable? If you’re 
using grammatical evolution, on the other hand, are your evolved individuals 
using your grammar as you expected? Or is the grammar in fact biasing 

2 If the system you’re using doesn’t allow you to dump individuals from a run, add 
that functionality or use a different system. 



13.6 Study your Populations 



135 



the system in an undesirable and unexpected way? Similar questions can be 
asked for almost any flavour of GP; think about your goals and expectations, 
and explore your populations to see to what degree those are being met. 

Similarly, it can be valuable to look at the way your population changes 
over time in more detail than that provided by the standard plot of fitness 
vs. time. You might look at the distribution of tree sizes during your run, 
or the distribution of fitness values. The distribution of fitness values might 
suggest things about the structure of the search space as seen by your GP 
system. If it seems to be dominated by disjoint values with large gaps 
between them, then jumping those gaps may be a major challenge for your 
system and it may be the cause for poor performance. 

While it is important to look inside your populations, the time and ef- 
fort required to do so is effectively a function of how much information is 
recorded. Computer algorithms can easily generate enormous amounts of 
data, especially if you produce a detailed log of events and individuals gener- 
ated during your runs. Consequently, processing those results may become a 
challenging data-mining exercise. Finding good ways to visualise those large 
data sets can be extremely valuable. While there are a handful of papers 



that specifically address visualisation, e.g., (Daida, Hilss, Ward, and Long 
2005 Pohlheim 1999 Yamashiro, Yoshikawa, and Furuhashi[ 20061, and 



even the occasional workshop ( Smith, Bullock, and Bird 2002 1 , most visu- 
alisation techniques are scattered through the literature and we are unaware 
of any comprehensive review. Where we can provide a bit more guidance is 
program visualisation. 

An obvious (but easy to forget) advantage of GP is that we create visible 
programs. This need not be the case with other approaches. So, when 
presenting GP results, as a matter of routine one should consider making 
a figure which contains the whole evolved program. The dot component of 
the Graphviz packag^Jcan be particularly helpful in this regard; Figure 6.2 



is an example of a tree diagram generated with a simple dot input file. The 
program lisp2dot|^]can help with the conversion from Lisp-style expressions 
to dot input files. 

As the evolved trees can often be very large, it is usually helpful to per- 
form at least some basic simplifications such as removing excess significant 
digits in constants and combining constant terms. Naturally, after clean- 
ing up the evolved program, one should make sure it still works; you should 
also clearly indicate in any presentation or write-up that the program you’re 
presenting has been cleaned and is not the actual tree generated by GP. 

There are methods to automatically simplify expressions (e.g., in Mathe- 
matica and Emacs). However, since in general there is an exponentially large 
number of equivalent expressions, automatic simplification is hard. Another 



a http : //www . graphviz . org/ 

4 http : //www . cs . ucl . ac . uk/staf f /W . Langdon/lisp2dot . html 
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Figure 13.1: Visualisation of the size and shape of the entire population of 
1,000 individuals in the final generation of runs using a depth limit of 50 (on 
the left) and a size limit of 600 (on the right). The inner circle is at depth 



50, and the outer circle is at depth 100. These plots are from (Crane and 
McPhee[ 2005) and were drawn using the techniques described in (Daida 



et al. 2005). 



way is to use GP as a multi-objective evolutionary algorithm (cf. Chapter]!)]) 
In some cases the details of the trees are less important than their general 
size and shape. Daida et al. (20051 presented a particularly useful set of 
visualisation techniques for this situation^] These techniques allow one to 
see the size and shape of both individual trees as well as an aggregate view 
of entire populations. Figure 13. 1| for example, shows the impact of size and 
depth limits on the size and shape of trees in two different runs with very 
similar average sizes and depths. The plots make it clear, however, that the 
shapes of the resulting trees were quite different. 



13.7 Encourage Diversity 

One important property to keep an eye on is population diversity. Two 
particular measures that can be useful sources of information are: 

Frequency of primitives Recognising when a primitive has been com- 
pletely lost from the population (or its frequency has fallen to a low 
level, consistent with the mutation rate) may help to diagnose prob- 
lems. 

5 A Mathematica implementation of this technique can be downloaded from http: 
//library .wolf ram . com/inf ocenter/MathSource/5163/ 
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Population variety If the variety — the number of distinct individuals in 
the population — falls below 90% of the population size, this may in- 
dicate that there is a problem. However, a high variety does not mean 
the reverse. GP populations often contain introns (Section 11.31, and 
so programs which are not identical may behave identically (|Gustafson 



2004 McPhee and Hopper 1999). Being different, these individuals 



contribute to a high variety. So, a high variety need not indicate all 
is well. Measuring phenotypic variation (i.e., diversity of behaviour) 
may also be useful (McPhee, Ohs, and Hutchison] [2008| . 



Insufficient diversity may cause significant problems. Panmicticj^] steady- 
state populations with tournament selection, reproduction and crossover, for 
example, are prone to premature convergence. If you find this to be an issue, 
measures should be taken to encourage population diversity such as: 
























Not using the reproduction operator. 

Adding one or more mutation operators. 

Using a weaker selection mechanism, e.g., using smaller tournament 
sizes. 

Using uniform random selection (instead of the standard negative tour- 
naments) to decide which individuals to remove from the population]^] 

Using a generational population model instead of a steady-state model. 

Splitting large populations into semi-isolated denies (Section 10.51 

Using fitness sharing to encourage the formation of many fitness niches. 



13.8 Embrace Approximation 

There is a widespread belief that computer programs are fragile and that 
any change to any bit in them will cause them to stop working. This is 
fostered by the common knowledge that a small typing mistake by a human 
programmer can sometimes introduce a troublesome bug into a program. 

6 In a panmictic population no mating restrictions are imposed as to which individual 
mates with which. 

7 Doing this means that the selection scheme is no longer elitist, and it may be worth- 
while to protect the best individual(s) to preserve the elitism. 

8 What is meant by a “large population” has changed over time. In the early days 
of GP, populations of 1,000 or more could be considered large. However, CPU speeds 
and computer memory have increased exponentially over time. So, at the time of writing 
it is not unusual to see populations of hundred of thousands or millions of individuals 
being used in the solution of hard problems. Research indicates that there are benefits in 
splitting populations into demes even for much smaller populations. See Section 110.51 
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Programmers know from painful experience, however, that far from proving 
immediately fatal, errors can lay hidden for years. Further, not all errors 
are created equal. Some are indeed critical and must be dealt with immedi- 
ately, while others are rare or largely inconsequential and so never become a 
major priority. The worst are arguably the severe bugs that rarely express 
themselves, as they can be extremely difficult to pin down yet still have dire 
consequences when they appear. 

In summary, there is no such thing as a perfect (non-trivial) human- 
written program and all such programs include a variety of errors of different 
severity and with a different frequency of manifestation p] 

This sort of variability is also very common in GP work. It provides the 
sort of toehold that evolution can exploit in the early generations of GP 
runs. The population of programs just needs to contain a few which move 
vaguely in the right direction. Many of their offspring may be totally blind or 
have no legs, just so long as a few continue to slime towards the light. Over 
generations evolution may hopefully cobble together some useful features 
from this initially unpromising ooze. The results, however, are unlikely 
to be perfect or pretty. If you as a GP engineer insist on only accepting 
solutions that are beautifully symmetric and walk on two legs on day one, 
you are likely to be disappointed. As we have argued above, even human- 
written programs often only approximate their intended functionality. So, 
why should we not accept the same from GP? 

If you accept this notion, then it is important to provide your system with 
some sort of gradient upon which to act, allowing it to evolve ever better 
approximations. It is also important to ensure that your test environment 
(usually encapsulated in the fitness function) places appropriate emphasis on 
the most important features of the space from a user perspective. Consider a 
problem with five test cases, four of which are fairly easy and consequently 
less important, with the fifth being crucial and quite difficult. A likely 
outcome in such a setting is that individuals that can do the four easier 
tasks, but are unable to make the jump to the fifth. There are several 
things you could try: 1) weighting the hard task more heavily, 2) dividing 
it up in some way into additional sub-tasks, or 3) changing it from being a 
binary condition (meaning that an individual does or does not succeed on the 
fifth task) to a continuous condition, so that an individual GP program can 
partially succeed on the fifth task. The first of these options is the simplest 
to implement. The second two, however, create a smoother gradient for the 
evolutionary process to follow, and so may yield better results. 



9 This is, of course, no excuse for writing shoddy, bug-ridden code. 
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13.9 Control Bloat 



If you are running out of memory or your execution times seem inordinately 
high, look at how your average program size is changing over time. If pro- 
grams are growing extremely fast, you may want to implement some form 



of bloat control (see Section 11.3 1 . Naturally, long runs may simply be the 



result of the population being very large or the fitness evaluation being slow. 
In these cases, you may find the techniques described in Chapter [T0| helpful. 

Controlling bloat is also important if your goal is to find a comprehensible 
model, since in practice smaller models are easier to understand. A large 
model will not only be difficult to understand but also may over-fit the 
training data (Geliy, Teytaud, Bredeche, and Schoenauer 2006). 



13.10 Checkpoint Results 

Where GP run time is long, it is important to periodically save the current 
state of the run. Should the system crash, the run can be restarted from 
part way through rather than at the start. Care should be taken to save the 
entire state, so restarting a run does not introduce any unknown variation. 
The bulk of the state to be saved is the current population. This can be 
compressed, e.g., using gzip. While compression can add a few percent 
to run time, reductions in disk space to less than one bit per primitive 
in the population have been achieved. Checkpointing also allows you to 
later continue runs that seemed particularly promising when they reached 
whatever maximum generation you set initially. 



13.11 Report Well 

There are many potential reasons why work may be poorly received. Here 
are a few: insufficient explanation of methods and algorithms, insufficient 
experimental evidence, insufficient analysis, lack of statistical significance, 
lack of replicability, reading too much into one’s results, insufficient novelty, 
poor presentation and poor English. In scientific, rather than commercial, 
work it is vital to report enough details so that someone else can reproduce 
your results. One very useful idea is to publish a table summarising your 
GP run. Table |4~T| (page [TTT]) contains an example tableau. 

As explained in Section |13.2[ it is essential to ensure that results are 
statistically significant so that nobody can dismiss them as the consequence 
of a lucky fluke. Complex ideas are often best explained by diagrams. When 
possible, descriptions of non-trivial algorithms should be accompanied by 
pseudocode, along with text describing the most important components of 
the algorithm. 
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13 Troubleshooting GP 



In addition to reporting your results, make sure you also discuss their 
implications. If, for example, what GP has evolved means the customer can 
save money or could improve their process in some way, then this should 
be highlighted. Also be careful to not construct excessively complex expla- 
nations for the observations. It is very tempting to say “X is probably due 
to Y”, but for this to be believable one should at least have made some 
attempt to check if Y is indeed taking place, and whether modulations or 
suppression of Y in fact produce modulations and/or suppression of X. 

Finally, the most likely outcomes of a text that is badly written or badly 
presented are: 1) your readers will misunderstand you, and 2) you will have 
fewer readers. Spell checkers can help with typos, but whenever possible 
one should ensure a native English speaker has proofread the text. 

13.12 Convince your Customers 

For any work in science, engineering, industry or commerce to make an im- 
pact it must be presented in a form that can convince others of the validity 
of its results and conclusions. This might include: a pitch within a corpo- 
ration seeking continued financial support for a project, the submission of 
a research paper to a journal or the presentation of a GP-based product to 
potential customers. 

The burden of proof is on the users of GP, and it is important to use the 
customer’s language. If the fact that GP discovers a particular chemical is 
important in a reaction or drug design, for example, one should make this 
stand out during the presentation. A great advantage of GP over many AI 
techniques in that its results are often simple equations. Ensure these are 
intelligible to your customer, e.g., by simplification. Also make an effort to 
present your results using your customer’s terminology. Your GP system 
may produce answers as trees, but if the customers use spreadsheets, con- 
sider translating the tree into a spreadsheet formula. Alternatively, your 
customer may not be particularly interested in the details of the solution, 
but instead care a great deal about which inputs the evolutionary process 
tended to use. 

Also, one should try to discover how the customers intend to validate 
GP’s answer. Do not let them invent some totally new data which has 
nothing to do with the data they supplied for training ( “just to see how well 
it does...”). Avoid customers with contrived data: GP is not omnipotent 
and knows nothing about things it has not seen. At the same time you 
should be scrupulous about your own use of holdout data. GP is a very 
powerful machine learning technique, and with this comes the ever present 
danger of over-fitting. One should never allow performance on data reserved 
for validation to be used to choose which answer to present to the customer. 





Chapter 14 

Conclusions 



In his seminal paper entitled “Intelligent Machinery”, Turing (1948) identi- 



fied three ways by which human-competitive machine intelligence might be 
achieved. In connection with one of those ways, Turing said: 

There is the genetical or evolutionary search by which a com- 
bination of genes is looked for, the criterion being the survival 



value. (Turing 19481 



Turing did not specify how to conduct the “genetical or evolutionary 
search” for machine intelligence. In particular, he did not mention the idea of 
a population-based parallel search in conjunction with sexual recombination 
(crossover) as described in Holland’s 1975 book Adaptation in Natural and 

second edition). However, in Turing’s 



1992 



Artificial Systems (Holland 
paper “Computing Machinery and Intelligence” ( Turing 1950 ) , he did point 
out: 



We cannot expect to find a good child-machine at the first at- 
tempt. One must experiment with teaching one such machine 
and see how well it learns. One can then try another and see 
if it is better or worse. There is an obvious connection between 
this process and evolution: 



‘Structure of the child machine’ = Hereditary material 
‘Changes of the child machine’ = Mutations 

‘Natural selection’ = Judgement of the experimenter 

In other words, Turing perceived that one possibly productive approach 
to machine intelligence would involve an evolutionary process in which a 
description of a computer program (the hereditary material) undergoes pro- 
gressive modification (mutation) under the guidance of natural selection 
(that is, selective pressure in the form of what we now call “fitness”). 
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14 Conclusions 



Today, decades later, we can see that indeed Turing was right. GP has 
started fulfilling his dream by providing us with a systematic method, based 
on Darwinian evolution, for getting computers to automatically solve hard 
real-life problems. To do so, it simply requires a high-level statement of 
what needs to be done and enough computing power. 

Turing also understood the need to evaluate objectively the behaviour ex- 
hibited by machines, to avoid human biases when assessing their intelligence. 
This led him to propose an imitation game, now known as the Turing test for 
machine intelligence , whose goals are wonderfully summarised by Samuel’s 
position statement quoted in the introduction of this book (page ll. The 
eight criteria for human competitiveness we discussed in Section 12. 3| are 
essentially motivated by the same goals. 

At present GP is unable to produce computer programs that would pass 
the full Turing test for machine intelligence, and it might not be ready 
for this immense task for centuries. Nonetheless, thanks to the constant 
improvements in GP technology, in its theoretical foundations and in com- 
puting power, GP has been able to solve dozens of difficult problems with 
human-competitive results and to provide valuable solutions to many other 
problems (see Chapter [l2| . These are a small step towards fulfilling Turing 
and Samuel’s dreams, but they are also early signs of things to come. It is 
reasonable to predict that in a few years time GP will be able to routinely 
and competently solve important problems for us, in a variety of application 
domains with human-competitive performance. Genetic programming will 
then become an essential collaborator for many human activities. This will 
be a remarkable step forward towards achieving true human-competitive 
machine intelligence. 

This field guide is an attempt to chart the terrain of techniques and 
applications we have encountered in our journey in the world of genetic 
programming. Much is still unmapped and undiscovered. We hope this 
book will make it easier for other travellers to start many long and profitable 
journeys in this exciting world. 



If you have found this book to be useful, please feel free to redistribute it 
(see pagepip. Should you want to cite this book, please refer to the entry for 



(Poli et al. 2008) in the bibliography. 



Part IV 



Tricks of the Trade 




In the end we find that Mary does indeed have a little GP. . . 
and the wolf is shown to have a very large bibliography. 



143 




Appendix A 

Resources 



The field of GP took off in the early 1990 ’s, driven in significant part by 
the publication of (Koza 1992). Those early days were characterised by the 
exponential growth common in the initial stages of successful technologies. 
Many influential papers from that period can be found in the proceedings 
of the International Conference on Genetic Algorithms (ICGA-93, ICGA- 
95), the IEEE conferences on Evolutionary Computation (EC-1994), and 
the Evolutionary Programming conferences. A surprisingly large number 
of these are now available on-line, and we’ve included as many URLs as 
we could in the bibliography^] After almost twenty years, GP has matured 
and is used in a wondrous array of applications from banking to betting, 
from bomb detection to architectural design, from the steel industry to the 
environment, from space to biology, and many others (as we have seen in 
Section 12 1 . 

In 1996 it was possible to list almost all the studies and applications of 
GP (Langdon 1996), but today the range is far too great. In this appendix 
we will review some of the wide variety of available sources on GP which 
should assist readers who wish to explore further. Consulting information 
available on the Web is certainly a good way to get quick answers for someone 
who wants to know what GP is. These answers, however, will often be too 
shallow for someone who really wants to then apply GP to solve practical 
problems. People in this position should probably invest some time going 
through more detailed accounts; some of the key books in the field include 



(Banzhaf, Nordin, Keller, and Francone 


1998a 


Koza 


1992 


Langdon and 


Poli 2002), and others are listed in Section 


A.l 


Technical papers in the 



extensive GP literature may be the next stage. Although this literature is 



easily accessible thanks to the complete on-line bibliography ( Langdon et al. 



1995-2008), newcomers will often need to be selective in what they read. The 



1 Each included URL was tested and was operational at the time of writing. 
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objective here may be different for different types of readers. Practitioners 
may wish to focus initially on papers which deal with the same problem they 
are interested in. Researchers and PhD students interested in developing a 
deeper understanding of GP should also make sure they identify and read as 
many seminal papers as possible, including papers or books on empirical and 
theoretical studies on the inner mechanisms and behaviour of GP. These 
are frequently cited in other papers and, so, can be easily identified. 



A.l Key Books 



There are today more than 31 books written in English principally on GP 
or its applications with more being written. These start with John Koza’s 
Genetic Programming 1992 (often referred to as Jaws). Koza has subse- 
quently published three additional books on GP: Genetic Programming II: 
Automatic Discovery of Reusable Programs (1994) deals with ADFs; Ge- 
netic Programming 3 (1999) covers, in particular, the evolution of analogue 
circuits; Genetic Programming 4 (2003) uses GP for automatic invention. 
MIT Press published three volumes in the series Advances in Genetic Pro- 



gramming ( 


Angeline and Kinnear 1996 1 Kinnear 1994c 


Spector, Langdon, 


O’Reilly, and Angeline, 


1999 


). The joint GP / genetic algorithms Kluwer 



book. series edited by Koza and Goldberg now contains 14 books starting 
with Genetic Programming and Data Structures (Langdon 1998). Apart 
from Jaws, these tend to be for the GP specialist. The late 1990s saw the 



introduction of the first textbook dedicated to GP (Banzhaf et al. 1998a). 



Eiben and Smith ( 2003 ) and Goldberg ( 1989 ) provide general treatments of 



evolutionary algorithms. 

Other titles include: Genetic Programming (in Japanese) (|Iba 1996bj), 



Principia Evolvica - Simulierte Evolution mit Mathematica (in German) 
(Jacob 1997| ) (English version (Jacob 2001 1), Data Mining Using Gram- 
mar Based Genetic Programming and Applications (Wong and Leung, 2000), 
Grammatical Evolution: Evolutionary Automatic Programming in a Arbi- 
trary Language ( O’Neill and Ryan) 20031, Humanoider: Sjavlarande robotar 
och artificiell intelligens (in Swedish) (Nordin and Johanna 2003), and Lin- 
ear Genetic Programming (|Brameier and Banzhaf 20071. 



Readers interested in mathematical and empirical analyses of GP be- 



haviour may find Foundations of Genetic Programming ( Langdon and Poli 



2002) useful. 



Each of Koza’s four books has an accompanying video. These videos are 
now available in DVD format. Also, a small set of videos on specific GP 
techniques and applications is available via on-line resources such as Google 
Video and YouTube. 



A. 2 Key Journals 
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A. 2 Key Journals 



In addition to GP’s own Genetic Programming and Evolvable Machines jour- 
nal, Evolutionary Computation , the IEEE transaction on Evolutionary Com- 
putation, Complex Systems (Complex Systems Publication, Inc.), the new 
Journal on Artificial Evolution and Applications and many others publish 



GP articles. The GP bibliography (Langdon et al. 1995-2008) lists a further 



375 different journals worldwide that have published articles related to GP. 



A. 3 Key International Meetings 

EuroGP - the European Conference on Genetic Programming - has been 
held every year since 1998. All EuroGP papers are available on line as part of 
Springer’s LNCS series. The original annual Genetic Programming confer- 
ence ran for three years (1996-1998) before combining in 1999 with the Inter- 
national Conference on Genetic Algorithms (ICGA) to form GECCO. 98% 
of GECCO papers are available on-line. The Michigan-based Genetic Pro- 



gramming Theory and Practice workshop (O’Reilly, Yu, Riolo, and Worzel 



2004| Riolo and Worzel) 2003; Riolo, Soule, and Worzel, 2007a Yu, Riolo, 


and Worzel 


2005) has recently published its fifth proceedings (Riolo, Soule, 


and Worzel) 


2007b 


). Other EC conferences, such as CEC, PPSN, Evolution 



Artificielle and WSC, also regularly contain GP papers. 



A. 4 GP Implementations 

One of the reasons behind the success of GP is that it is easy to implement 
own versions, and implementing a simple GP system from scratch remains 
an excellent way to make sure one really understands the mechanics of GP. 
In addition to being an exceptionally useful exercise, it is often easier to 
customise (e.g., adding new, application specific genetic operators or imple- 
menting unusual, knowledge-based initialisation strategies) a system one has 
built for new purposes than a large GP distribution. All of this, however, 
requires reasonable programming skills and the will to thoroughly test the 
resulting system until it behaves as expected. 

This is actually an extremely tricky issue in highly stochastic systems 
such as GP, as we discussed in Section |13.1| The problem is that almost 
any system will produce “interesting” behaviour, but it is typically very 
hard to test whether it is exhibiting the correct interesting behaviour. It 
is remarkably easy for small mistakes to go unnoticed for extended periods 
of time (even years) |^] It is also easy to incorrectly assume that “minor” 



2 Several years ago Nic and some of his students discovered that one of their systems 
had been performing addition instead of subtraction for several months due to a copy- 
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implementation decisions will not significantly affect the behaviour of the 
system (see Section 13.4). 

An alternative is to use one of the many public domain GP implemen- 
tations and adapt this for one’s purposes. This process is faster, and good 
implementations are often robust, efficient, well documented and compre- 
hensive. The small price to pay is the need to study the available documen- 
tation and examples. These often explain how to modify the GP system 
to some extent. However, deeper modifications (such as the introduction 
of new or unusual operators) will often require studying the actual source 
code and a substantial amount of trial and error. Good publicly avail- 



able GP implementations include: Lil-GP ( 


Punch and Zongker 


1998), ECJ 


(Luke, Panait, Balan, Paus, Skolicki, Popovici, Harrison, Bassett, Hubley, 


and Chircop 2000-2007), Open Beagle (Gagne and Parizeau 


2002) and 


GPC++ ( 
mercial im 


Fraser and Weinbrenner 1993-1997). The most prominent com- 


plementation remains Discipulus 


RML Technologies 


1998-2007); 



see (Foster 2001 1 for a review. A number of older unsupported tools can be 
found at ftp://cs.ucl.ac.uk/genetic/ftp.io.com/ 

While the earliest GP systems were implemented in Lisp, people have 
since coded GP in a huge range of different languages, including C/C++, 
Java (see an example in Appendix | b|), JavaScript, Perl, Prolog, Mathemat- 
ica, Pop-11, MATLAB, Fortran, Occam and Haskell. Typically, these evolve 
expressions and programs which look like simplified Lisp. More complex tar- 
get languages can be supported, however, especially with the use of more 
advanced tools such as grammars and type systems (see Chapter [6]). Con- 
versely, many successful programs in machine code or low-level languages 
have also climbed from the primordial ooze of initial randomness. 



A. 5 On-Line Resources 



On-line resources appear, disappear, and move with great speed, so all the 
addresses here (and elsewhere in the book), which were correct at the time 
of writing, are obviously subject to change without notice after publication. 
Hopefully, the most valuable resources should be readily findable using stan- 
dard search tools. 

One of the key on-line resources is the GP bibliography (|Langdon et al. 



1995-2008) At the time of writing, this bibliography contains about 5,000 



GP entries, roughly half of which can be downloaded immediately)^ 



paste error. Fortunately no published results were affected, but it was a very unsettling 
experience. 

"http: //www. cs .bham. ac .uk/~wbl/biblio/ 

4 The GP bibliography is a volunteer effort and depends crucially on submissions from 
users. Authors are encouraged to check that their GP publications are listed, and send 
missing entries to the bibliography’s maintainers. 
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The GP bibliography has a variety of interfaces, including a graphical 
representation of GP’s collaborative network (see Figure A.l I. The bibliog- 
raphy allows for quick jumps between papers linked by authors and allows 
one to sort the author list by the number of GP publications. Full refer- 
ences are provided in both BiBTj^X and Refer formats for direct inclusion 
in papers written in IATj^X and Microsoft Word, respectively. The GP bib- 
liography is also part of the Collection of Computer Sciences Bibliographies 



(Achilles and Ortyl 1995-20081, which provides a comprehensive Lucerne 



syntax search engine. 

From early on there has been an active, open email discussion list: the 



Genetic Programming mailing list (2001-20081. The EC-Digest (1985-20081 



is a moderated list covering evolutionary computation more broadly, and 
often contains GP related announcements. 

Koza’s http://www.genetic-programming.org/ contains a ton of use- 
ful information for the novice, including a short tutorial on “What is Genetic 
Programming” and the Lisp implementation of GP from Genetic Program- 
ming (Koza 1992). 
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Figure A.l: Co-authorship connections within GP. Each of the 1,141 dots 
indicates an author, and edges link people who have co-authored one or 
more GP papers. (To reduce clutter only links to first authors are shown.) 
The size of each dot indicates the number of entries. The on-line version is 
annotated using JavaScript and contains hyperlinks to authors and their 
GP papers. The graph was created by GraphViz twopi, which tries to 
place strongly connected people close together. This diagram displays just 
the “centrally connected component” (Tomassini et al. 2007) and contains 



approximately half of all GP papers. The remaining papers are not linked 
by co-authorship to this graph. Several other large components are also 



available on-line via the GP Bibliography (Langdon et al. 1995-2008). 



Appendix B 



TinyGP 



TinyGFQ i s a highly optimised GP system that was originally developed to 
meet the specifications set out in the TinyGP competition of the Genetic and 
Evolutionary Computation Conference (GECCO) 2004. We include it as a 
working example of a real GP system, to show that GP software tools are 
not necessarily big, complex and difficult to understand. The system can be 
used as is or can be modified or extended for a user’s specific applications. 
Furthermore, TinyGP may serve as a guide to other implementations of 
genetic programming. 

The following section provides a description of the main characteristics 
of TinyGP. Section [B . 2| describes the format for the input files for TinyGP. 
Section |B.3| provides further details on the implementation and the source 
code for a Java version of TinyGP. Finally, Section |B.4| describes a sample 
run of the system. 

There are numerous other GP systems available on-line. See Section |A.4| 
for a discussion of some of the options. 

B.l Overview of TinyGP 

TinyGP is a symbolic regression system with the following characteristics: 

1. The terminal set includes a user-definable number of floating point 
variables (named XI to XN). 

2. The function set includes multiplication, protected division, subtrac- 
tion and addition. 

3. The fitness cases are read from a file (the format is given below). 



J http : //cswww. essex . ac.uk/ staf f /rpoli/TinyGP/ 
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4. The system is steady state. A “generation” is considered concluded 
when POPSIZE (see below) crossover/mutation events have been per- 
formed. 

5. Selection is performed using tournament selection. 

6. Negative tournaments are used for the selection of the individuals to 
be replaced at each steady-state-GP iteration. 

7. Subtree crossover is used. The selection of crossover points is uniform, 
so every node is chosen equally likely. 

8. Point mutation is used. That is, points (nodes) in the tree are ran- 
domly chosen. If a point is a terminal, then it is replaced by another 
randomly chosen terminal. If it is a function, then it is replaced by 
another randomly chosen function with the same number of inputs. 

9. The following parameters are implemented as static class variables: 

• The maximum length any GP program can take: MAX_LEN. 

• The size of the population: POPSIZE. 

• The maximum depth initial programs can have: DEPTH. Note 0 
represents the depth of programs containing just one terminal. 

• The maximum number of generations allowed for a run: 

GENERATIONS. 

• The probability of creating new individuals via 

crossover: CR0SS0VER_PR0B. The mutation probability is 

1 - CR0SS0VER_PR0B. 

• The mutation probability (per node) when point mutation is cho- 
sen as the variation operator: PMUT_PER_N0DE. 

• The tournament size: TSIZE. 

10. The parameters and the random seed are printed when each run starts. 

11. The fitness function is minus the sum of the absolute differences be- 
tween the actual program output and the desired output for each fit- 
ness case. TinyGP maximises it. 

12. The grow initialisation method is used to create the initial population. 

13. At each generation the following statistics are calculated and printed: 

• The generation number. 

• The average fitness of the individuals in the population. 

• The fitness of the best individual in the population. 
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• The average size of the programs in the current generation. 

• The best individual in the population. 

14. The random number generator can be seeded via the command line. 
If this command line parameter is absent, the system uses the current 
time to seed the random number generator. 

15. The name of the file containing the fitness cases can be passed to 
the system via the command line. If the command line parameter is 
absent, the system assumes the data are stored in the current directory 
in a file called “problem.dat”. 

16. If the total error made by the best program goes below 10 -5 TinyGP 
prints a message indicating success and stops. If the problem has 
not been solved after the maximum number of generations, it prints a 
message indicating failure and stops. 



B.2 Input Data Files for TinyGP 

The input files for TinyGP have the following plain ASCII format: 
HEADER. // See below 

FITNESSCASE1 // The fitness cases (one per line) 

FITNESSCASE2 

FITNESSCASE3 



Each fitness case is of the form 
XI ... XN TARGET 

where XI to XN represent a set of input values for a program, while 
TARGET represents the desired output for the given inputs. 

The header has the following entries 

NVAR NRAND MINRAND MAXRAND NFITCASES 

where NVAR is an integer representing the number of variables the system 
should use, NRAND is an integer representing the number of random con- 
stants to be provided in the primitive set, MINRAND is a float representing 
the lower limit of the range used to generate random constants, MAXRAND 
is the corresponding upper limit, and NFITCASES is an integer represent- 
ing the number of fitness cases. NRAND can be set to 0, in which case 
MINRAND and MAXRAND are ignored. For example: 



1 100 -5 5 63 
0.0 0 
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0.1 0.0998334166468282 
0.2 0.198669330795061 
0.3 0.29552020666134 

55 LINES OMITTED 

5.9 -0.373876664830236 

6.0 -0.279415498198926 

6.1 -0.182162504272095 

6.2 -0.0830894028174964 

These fitness cases are sin(x) for x £ {0.0, 0.1, 0.2, . . . 6.2} 



B.3 Source Code 

The original TinyGP system was implemented, in the C programming lan- 
guage, to maximise efficiency and minimise the size of the executable j^| The 
version presented here is a Java re-implementation of TinyGP. The original 
version did not allow the use of random numerical constants. 

How does TinyGP work? The system is based on the standard flattened 
(linear) representation for trees, which effectively corresponds to listing the 
primitives in prefix notation but without any brackets. Each primitive occu- 
pies one byte. A program is simply a vector of characters. The parameters 
of the system are as specified in Section |B.1| They are fixed at compile time 
through a series of static class variable assignments. The operators used 
are subtree crossover and point mutation. The selection of the crossover 
points is performed at random with uniform probability. The primitive set 
and fitness function are as indicated above. The code uses recursion for the 
creation of the initial population (grow) , for the identification of the subtree 
rooted at a particular crossover point (traverse), for program execution 
(run), and for printing programs (print_indiv). A small number of global 
variables have been used. For example, the variable program is a program 
counter used during the recursive interpretation of programs, which is auto- 
matically incremented every time a primitive is evaluated. Although using 
global variables is normally considered bad programming practice, this was 
done purposely, after extensive experimentation, to reduce the executable’s 
size. 



2 The C version of TinyGP is probably the world’s smallest tree-based symbolic- 
regression GP system. The source code, in C, is 5,906 bytes. The original version included 
a compilation script which, with a variety of tricks, created a self-extracting executable 
occupying 2,871 bytes (while the actual size of the executable after self-extraction was 
4,540 bytes). All optimisations in the code were aimed at bringing the executable size 
(as opposed to the source code size) down, the main purpose being to show that, against 
popular belief, it is possible to have really tiny and efficient GP systems. 
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The code reads command line arguments using the standard args array. 

Generally the code is quite standard and should be self-explanatory for 
anyone who can program in Java, whether or not they have implemented a 
GP system before. Therefore very few comments have been provided in the 
source code. 

The source is provided below. 

/* 

* Program: tiny _gp . java 

* 

* Author: Riccardo Poli (email: rpoli@essex.ac.uk) 

* 

*/ 

import java, util .*; 

import java . io . * ; 

import java . text . DecimalFormat ; 

public class tiny.gp { 
double [] fitness; 
char [ ] [ ] pop ; 

static Random rd = new Random () ; 
static final int 
ADD = 110, 

SUB =111, 

MUL =112, 

DIV = 113, 

FSET.START = ADD, 

FSET_END = DIV; 

static double [] x = new double [FSET_ST ART ] ; 
static double minrandom , maxrandom ; 
static char [] program; 
static int PC; 

static int varnumber , fitnesscases , randomnumber ; 
static double fbestpop = 0.0, favgpop = 0.0; 
static long seed; 
static double avg_len ; 
static final int 
MAXXEN = 10000, 

POPSIZE = 100000, 

DEPTH = 5, 

GENERATIONS = 100, 

TSIZE = 2; 

public static final double 

PMUTJPERJNODE = 0.05, 

CROSSOVERPROB = 0.9; 
static double [][] targets; 

double run ( ) { /* Interpreter */ 
char primitive = program [PC++] ; 
if ( primitive < FSET .START ) 
return (x [ primitive ] ) ; 
switch ( primitive ) { 
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case ADD : return( run() + run() ); 

case SUB : return ( run() — run() ); 

case MUL : return( run() * run() ); 

case DIV : { 

double num = run() , den = run() ; 
if ( Math.abs( den ) <= 0.001 ) 
return ( num ) ; 
else 

return ( num / den ) ; 

} 

} 

return ( 0.0 ); // should never get here 

} 

int traverse ( char [] buffer , int buffercount ) { 
if ( buffer [buffercount ] < FSET_START ) 
return( +- |-buffercount ); 

switch ( b u ffer [ buffercount ] ) { 
case ADD: 
case SUB: 
case MUL: 
case DIV : 

return( traverse ( buffer , traverse ( buffer , +- l-buffercount 

) ) ); 

} 

return ( 0 ) ; // should never get here 

} 



void set up_fitness ( String fname) { 
try { 

int i , j ; 

String line ; 

BufferedReader in = 
new BufferedReader ( 

new 

FileReader (fname) ) ; 
line = in . readLine ( ) ; 

St ringTokenizer tokens = new StringTokenizer ( line ) ; 
varnumber = Int eger . parselnt ( tokens . nextToken (). trim ()) ; 
randomnumber = Int eger . parselnt ( tokens . nextToken (). trim () ) 



88 minrandom = Double . parseDouble ( tokens . nextToken () . 

trim ( ) ) ; 

89 maxrandom = Double . parseDouble ( tokens . nextToken (). trim () ) 

90 fitnesscases = Int eger . parselnt ( tokens . nextToken (). trim () ) 



91 targets = new double [ fit nesscases ][ varnumber + 1] ; 

92 if (varnumber + randomnumber >= FSET_START ) 

93 System .out. println(” too ^many^ variables ^and^ const ants” ) ; 

94 
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for (i = 0; i < fitnesscases; i +- K ) { 
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line = in . readLine ( ) ; 

tokens = new S t ringTokenizer ( 1 i ne ) ; 

for (j = 0; j <= varnumber ; j++) { 

t ar get s [ i ] [ j ] = Double . parseDouble ( tokens . nextToken () . 
trim ( ) ) ; 

} 

} 

in . close () ; 

} 

catch ( FileNotFoundException e) { 

System . out . println ( ’’ERROR: ^Please^provide^a^data^file”) ; 
System . exit (0) ; 

} 

catch ( Except ion e ) { 

System . out . println (’’ERROR: - Incorrect ^data^format” ) ; 

System . exit (0) ; 

} 

} 



double fit ness _funct ion ( char [] Prog ) { 
int i = 0 , len ; 
double result , fit = 0.0; 

len = traverse ( Prog, 0 ); 
for (i = 0; i < fit nesscases ; i ++ ) { 
for (int j = 0; j < varnumber; j +- K ) 

X[j] = targets [ i ] [ j ] ; 
program = Prog ; 

PC = 0; 

result = run () ; 

f i t += Math . abs ( result — targets [i][ varnumber ] ) ; 

} 

return(— fit ) ; 



int grow ( char [] buffer , int pos , int max, int depth ) { 

char prim = (char) rd . next Int (2) ; 

if ( pos >= max ) 
return ( —1 ) ; 

if ( pos = 0 ) 
prim = 1 ; 

if ( prim = 0 | | depth = 0 ) { 

prim = (char) rd . next Int ( varnumber + randomnumber) ; 
buffer [pos] = prim; 
return (pos + 1) ; 

} 

else { 

prim = (char) ( rd . next Int (FSET.END - FSET_START + 1) + 
FSET_START) ; 
switch(prim) { 
case ADD: 
case SUB: 
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case MUL: 
case DIV : 

buffer [ pos ] = prim; 

returnf grow ( buffer , grow ( buffer , pos + 1, max , depth — 1) , 
max, depth— 1 ) ); 

} 

} 

return ( 0 ) ; // should never get here 



int print_indiv( char [] buffer , int buffercount er ) { 
int al =0, a2 ; 

if ( buffer [ buffercount er ] < FSET_START ) { 
if ( buffer [ buffercount er ] < varnumber ) 

System . out . print ( ”X”+ ( buffer [ buffercount er ] + 1 ) + ” 

); 

else 

System .out. print ( x[buffer[buffercounter]]) ; 
return( +- l-buffercount er ); 

} 

switch ( buffer [ buffercount er ] ) { 

case ADD: System . out . print ( 

al=pr int _indiv ( buffer , +- l-buffercounter ); 

System . out . print ( ” ) ; 

break ; 

case SUB: System . out . print ( 

al=print _indiv ( buffer , +- l-buffercounter ); 

System . out . print ( ” V’ ) ; 

break ; 

case MUL: System . out . print ( 

al=p r i n t _i nd i v ( buffer , +- l-buffercounter ); 

System . out . print ( 

break ; 

case DIV: System . out . print ( 

al=p r i n t _i nd i v ( buffer , +- l-buffercounter ); 

System . out . print ( ” ) ; 

break : 

} 

a2=print _indi v ( buffer , al ) ; 

System . out . print ( ” ; 

return ( a2 ) ; 



static char [] buffer = new char [MAX_LEN] ; 
char [] create_random_indi v ( int depth ) { 
char [ ] ind ; 
int len ; 

len = grow( buffer , 0, MAX_LEN, depth ); 
while (len < 0 ) 

len = grow ( buffer , 0, MAXJLEN, depth ); 
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202 ind = new char[len]; 

203 

204 System . arraycopy ( buffer , 0 , ind, 0 , len ); 

205 return ( ind ); 

206 } 

207 

208 char [][] create_random_pop ( int n, int depth, double [ 

fitness ) { 

209 char [ ] [ ] pop = new char [ n ] [ ] ; 

210 int i ; 

211 

212 for ( i = 0 ; i < n; i +- b ) { 

213 pop [ i ] = creat e_random_indiv ( depth ); 

214 fitness[i] = fit ness.function ( pop [ i ] ); 

215 } 

216 return ( pop ); 

217 } 

218 

219 

220 void stats ( double [] fitness , char [] [] pop, int gen ) { 

221 int i , best = rd . nextlnt (POPSIZE) ; 

222 int node_count = 0 ; 

223 fbestpop = fit ness [ best ] ; 

224 favgpop = 0 . 0 ; 

225 

226 for ( i = 0 ; i < POPSIZE; i +- K ) { 

227 node_count 4= traverse ( pop [ i ] , 0 ); 

228 favgpop += fitness [i]; 

229 if ( fitness [i] > fbestpop ) { 

230 best = i ; 

231 fbestpop = fitness [i]; 

232 } 

233 } 

234 avg_len = (double) node.count / POPSIZE; 

235 favgpop /= POPSIZE; 

236 System . out . print (” Generation=”+gen+” -Avg- Fit ness=” +(— favgpop 

) + 

237 ” - Best -Fitness=”+(— fbestpop )+” -Avg- Size=” + 

avg_len+ 

238 ” \nBest -Individual :-”) ; 

239 print_indiv( pop [best], 0 ); 

240 System . out . print ( ”\n”); 

241 System . out . flush () ; 

242 } 

243 

244 int tournament ( double [] fitness , int tsize ) { 

245 int best = rd . nextlnt (POPSIZE) , i , competitor ; 

246 double fbest = — 1 . 0 e 34 ; 

247 

248 for ( i = 0 ; i < tsize ; i +- K ) { 

249 competitor = rd . nextlnt (POPSIZE) ; 

250 if ( fitness [competitor] > fbest ) { 

251 fbest = fit ness [ competitor ] ; 

252 best = competitor ; 

253 } 
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} 

return ( best ) ; 



int negative.tournament ( double [] fitness , int tsize ) { 
int worst = rd . nextlnt (POPSIZE) , i , competitor ; 
double fworst = le34; 

for ( i = 0; i < tsize ; i +- K ) { 
competitor = rd . nextlnt (POPSIZE) ; 
if ( fitness [competitor] < fworst ) { 
fworst = f i t n e s s [ competitor ] ; 
worst = competitor ; 

} 

} 

return ( worst ); 

} 

char [] crossover ( char [] parentl , char [] parent2 ) { 
int xolstart , xolend , xo2start , xo2end ; 
char [] offspring ; 

int lenl = traverse ( parentl , 0 ) ; 
int len2 = traverse ( parent2 , 0 ); 
int lenoff ; 

xolstart = rd . next Int ( len 1 ) ; 

xolend = traverse ( parentl, xolstart ); 

xo2start = rd . next Int ( len2 ) ; 

xo2end = traverse ( parent2 , xo2start ); 

lenoff = xolstart + (xo2end — xo2start) + ( lenl —xolend ) ; 
offspring = new char [ lenoff ] ; 

System . arraycopy ( parentl, 0, offspring, 0, xolstart ); 
System . arraycopy ( parent2 , xo2start , offspring , xolstart , 
(xo2end — xo2start) ); 

System . arraycopy ( parentl , xolend, offspring , 
xolstart + (xo2end — xo2start) , 

( lenl —xolend ) ); 

return( offspring ); 

} 

char [] mutation ( char [] parent , double pmut ) { 
int len = traverse ( parent , 0 ) , i ; 
int mutsite ; 

char [] parentcopy = new char [len]; 

System . arraycopy ( parent, 0, parentcopy, 0, len ); 
for (i = 0; i < len; i ++ ) { 

if ( rd . nextDouble ( ) < pmut ) { 
mutsite = i ; 

if ( parentcopy [ mutsite ] < FSET_START ) 
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parentcopy [ mutsite ] = (char) rd . next Int ( varnumber ) ; 
else 

switch ( parentcopy [ mutsite ] ) { 

case ADD: 

case SUB: 

case MUL: 

case DIV : 

parentcopy [ mutsite ] = 

(char) (rd . nextlnt (FSET.END - FSETJ3TART + 1) 
+ FSET.START) ; 

} 

} 

} 

return ( parentcopy ); 



void print_parms () { 

System . out . print ( ” — „TINY„GP„ ( Java version ) „ \n” ) ; 

System . out . print ( ”SEED=”+seed+” \nMA X T, EN=”+MA X TE N+ 
” \ nPOPSIZE=” +POPSIZE+” \nDEPTH=” +DEPTH+ 

” \nCROSSOVER_PROB=” -t-CROSSOVEFtPROBt- 
” \nPMUTJ 3 ERJ^ODE=”+PMUTJ :> ERNODEf 
” \nMIN_FtANDOM=”+minrandom+ 

” \nMAXRANDOM=” +maxrandom+ 

” \nGENERATIONS=”+GENERATIONSf 
” \nTSIZE=”+TSIZE+ 



public tiny_gp ( String fname , long s ) { 
fitness = new double [POPSIZE ] ; 
seed = s ; 
if ( seed >= 0 ) 

rd.setSeed(seed) ; 
setup.fitness (fname) ; 

pop = create_random_pop (POPSIZE, DEPTH, fitness ); 
for ( int i = 0; i < FSET_START| i ++ ) 

x [ i]= ( maxrandom— minrandom ) *rd . nextDouble ()+minrandom ; 

} 



void evolve ( ) { 

int gen = 0, indivs , offspring , parentl , parent2 , parent ; 

double newfit ; 

char [ ] newind ; 

print _parms ( ) ; 

stats( fitness , pop, 0 ); 

for ( gen = 1; gen < GENERATIONS; gen ++ ) { 
if ( fbestpop > — le— 5 ) { 

System . out . print ( ”PROBLEM„SOLVED\n” ) ; 

System . exi t ( 0 ) ; 

} 

for ( indivs = 0; indivs < POPSIZE; indivs ++ ) { 
if ( rd. nextDouble () > CROSSOVERPROB ) { 
parentl = tournament ( fitness , TSIZE ); 
parent2 = tournament ( fitness , TSIZE ); 
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364 newind = crossover ( pop [ parent 1 ], pop [ parent 2 ] ); 

365 } 

366 else { 

367 parent = tournament ( fitness , TSIZE ); 

368 newind = mutation ( pop [parent], PMUT_PER_NODE ); 

369 } 

370 newfit = fit ness_funct ion ( newind ); 

371 offspring = negative-tournament ( fitness , TSIZE ); 

372 pop [ offspring ] = newind; 

373 fitness [offspring] = newfit; 

374 } 

375 stats ( fitness , pop, gen ); 

376 } 

377 System . out . print (” PROBLEM^ *NOT* J30LVED\n” ) ; 

378 System, exit ( 1 ); 

379 } 

380 

381 public static void main ( St ring [ ] args) { 

382 String fname = ’’problem.dat”; 

383 long s = —1; 

384 

385 if ( args. length = 2 ) { 

386 s = Integer. valueOf(args[0]).intValue(); 

387 fname = args[l]; 

388 } 

389 if ( args. length = 1 ) { 

390 fname = args[0]; 

391 } 

392 

393 tiny_gp gp = new t iny_gp ( fname , s); 

394 gp . evolve ( ) ; 

395 } 

396 } ; 



B.4 Compiling and Running TinyGP 

It is very common, nowadays, for people to write and execute code within 
some development environment. Each has its own way of doing these oper- 
ations, but the process is typically very straightforward. 

If one wants to compile TinyGP from the operating system’s shell, this 
can be done by issuing the command javac -0 tiny_gp . java. This applies 
to both Unix and Windows users. Windows users will have to click on 
Start— > Run and then issue the command cmd to launch a shell. Of course, 
if the javac Java compiler and/or the tiny_gp. java source file are not in 
the current directory/folder, then full path names must be provided when 
issuing the compilation command. 

If the dataset is stored in a file problem.dat, the program can then 
simply be launched with the command java tiny_gp. Otherwise, the user 
can specify a different datafile on the command line, by giving the command 
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java tiny_gp FILE, where FILE is the dataset file name (which can include 
the full path to the file). Finally, the user can specify both the datafile and 
a seed for the random number generator on the command line, by giving 
the command java tiny_gp SEED FILE, where SEED is an integer. 

As an example, we ran TinyGP on the sin(a;) dataset described in Sec- 
tion pT2| (which is available at http://cswww.essex.ac.uk/staff/rpoli/ 
TinyGP/sin-data.txt). The output produced by the program was some- 
thing like the following 

— TINY GP (Java version) — 

SEED=— 1 
MA X T,E N= 10000 
POPSIZE= 100000 
DEPTH=5 

CROSSOVERIPROB = 0 . 9 
PMUT_PEIGNODE= 0.05 
MIN JtANDOM= -5.0 
MAXPANDOM=5.0 
GENERATIONS= 100 
TSIZE=2 



Generation=0 Avg Fitness =42.53760218120066 Best Fitness 
= 39.997953686554816 Avg Size=10.9804 
Best Individual: (1.589816334458055 / -2.128280559500907) 
Generation=l Avg F it ness = 1 226 . 40441 5960088 Best Fitness 
= 24.441994244449372 Avg Size =10.97024 
Best Individual: (((-0.3839867944222206 / -2.2796127162428403)+ 
(-1.8386812853617673 / -1.06553859601892)) - 
(((4.984026635222818 * (0.17196413319878445 - 
0.1294044215655923)) + (XI - -1.8956001614031734)) * 
0.3627020733460027) ) 



The flip-o-rama animation in the footer of the bibliography and index 
include plots of the best and mean fitness, the mean program size and the 
behaviour of the best-so-far individual at each generation. The animation 
should be viewed by rapidly flipping the pages of the book from the begin- 
ning of the bibliography onward. For convenience, the plots corresponding 
to the final generation are also reported in Figure [B.l| 

As one can see, GP progressively evolves better and better approxima- 
tions to the sine function. The best individual at the end of the run had an 
error of 1.88. Its unsimplified version as produced by the system is 

(XI / ((-2.766097899954383 * (XI / (((XI / ((((XI / (XI * 

-3.2001163763204445)) * XI ) - -3.2001163763204445) * 
-3.2001163763204445)) + XI ) + (XI * (XI - 
3.9532436938954376))))) - (((XI * XI ) / (((XI / 

(3.9532436938954376 * -3.2001163763204445)) * XI ) - 
-3.2001163763204445)) / (((XI + XI ) / (XI * XI )) + XI )) 

)) 
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Figure B.l: Final generation of a TinyGP sample run: best and mean 
fitness (top), mean program size (middle) and behaviour of the best-so-far 
individual (bottom). 
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which can be simplified to 

x 

—axx x? 

sSe + x +x x(x-c) (| +a; )x(d-^) 

where 

a = 2.76609789995 
b = 10.240744822 
c = 3.9532436939 
d = 3.20011637632 
e = 12.6508398844 

Hardly an obvious approximation for the sine function, but still a very ac- 
curate oue, at least over the test range. 
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This book was primarily written using the document preparation 

system, along with BibTj^X, pdf latex and makeindex. Most of the editing 
was done using the eiacs and xemacs editors, along with extensions such 
as Reflex; some was done with T^XBhop as well. Most of the data plots 
were generated using gnuplot and the R statistics package. Diagrams were 
generated with a variety of tools, including the Graphviz package, tgif and 
xfig. A whole host of programming and scripting languages were used to 
automate various processes in both the initial scientific research and in the 
production of this book; they are too numerous to list here, but were crucial 
nonetheless. The cover was created with Adobe Photoshop^ and gimp. 

Coordinating the work of three busy, opinionated authors is not trivial, 
and would have been much more difficult without the use of revision control 
systems such as Subversion. Around 500 commits were made in a six month 
period, averaging around 10 commits per day in the final weeks. The actual 
files were hosted as a project at http://assembla.com; we didn’t realise 
until several months into the project that Assembla’s president is in fact 
Andy Singleton, who did some cool early work in GP in the mid-90’s. 

The “reviews” and “summaries” on the back cover were generated 
stochastically using the idea of N-grams from linguistics. For the “reviews” 
we collected a number of reviews of previous books on GP and EAs, and 
tabulated the frequency of different triples of adjacent words. These fre- 
quencies of triples in the source text were then used to guide the choices of 
words in the generated “reviews”. The only word following the pair “ad” 
and “hoc” in our source reviews, for example, was “tweaks”; thus once “ad” 
and “hoc” had been chosen, the next word had to be “tweaks”. The pair 
“of the”, on the other hand, appears numerous times in our source text, fol- 
lowed by words such as “field”, “body”, and “rapidly”. However, “theory” 
is the most common successor, and, therefore, the most likely to be cho- 
sen to follow “of the” in the generation of new text. The generation of the 
“summaries” was similar, but based on the front matter of the book itself. 



See (Poli and McPhee 2008a) for an application of these ideas in genetic 



programming. 



1 Adobe Photoshop is a registered trademark of Adobe Systems Incorporated 



